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Dt  O  VO 


Objectives 


During  the  past  few  years  we  have  been  involved  in  the  development  of  new  computational  methods  for 
quantifying  similarity/dissimilarity  of  chemicals  and  applications  of  quantitative  molecular  similarity 
analysis  (QMSA)  techniques  in  analog  selection  and  property  estimation  for  use  in  the  hazard 
assessment  of  chemicals.  We  have  also  explored  the  mathematical  nature  of  the  molecular  similarity  v 
space  in  order  to  better  understand  the  basis  of  analog  selection  by  QMSA  methods.  The  parameter 
spaces  used  for  QMSA  and  analog  selection  were  constructed  from  nonempirical  parameters  derived 
from  computational  chemical  graph  theory.  Occasionally,  graph  invariants  were  supplemented  with 
geometrical  parameters  and  quantum  chemical  indices  to  study  the  relative  effectiveness  of  graph 
invariants  vis-a-vis  geometrical  and  quantum  chemical  parameters  in  analog  selection  and  property 
estimation.  We  carried  out  comparative  studies  of  nonempirical  descriptor  spaces  and  physicochemical 
property  spaces  in  selecting  analogs.  Molecular  similarity  methods  were  applied  in  predicting  modes  of 
toxic  action  (MOA)  of  chemicals.  Our  similarity/dissimilarity  methods  have  also  found  successful 
applications  in  the  discovery  of  new  drug  leads  by  US  drug  companies. 

In  this  project,  we  will  have  four  primary  goals:  1)  development  of  a  hierarchical  approach  to  molecular 
similarity,  2)  formulation  of  quantitative  structure-activity  relationship  (QSAR)  models  for  predictive 
toxicology  using  a  hierarchical  approach,  3)  applications  of  hierarchical  QSAR  and  QMSA  approaches  in 
computational  toxicology  related  to  human  health  and  ecological  hazard  assessment,  and  4)  the 
application  of  hierarchical  QMSA  and  QSAR  approaches  in  estimating  potential  toxicity  of  deicing  agents. 

The  first  goal  of  the  project  is  the  use  of  parameters  of  gradually  increasing  complexity,  viz.,  topological, 
topochemical,  geometrical,  and  quantum  chemical  indices,  in  the  quantification  of  molecular 
similarity/dissimilarity  of  chemicals.  We  will  take  a  two-tier  approach  in  this  area.  First,  similarity  methods 
will  be  used  in  ordering  sets  of  molecules  and  in  selecting  structural  analogs  of  toxic  chemicals  which 
pose  human  health  and  ecological  hazards.  Secondly,  we  will  use  the  properties  of  selected  analogs  in 
estimating  toxicologically  important  properties  for  chemicals.  Although  different  classes  of  parameters 
have  been  used  in  the  characterization  of  molecular  similarity,  no  systematic  study  has  been  carried  out 
in  the  use  of  all  four  classes  of  parameters,  mentioned  above,  in  analog  selection  and  property 
estimation.  We  will  apply  a  hierarchical  approach  to  the  use  of  these  four  types  of  theoretical  molecular 
descriptors  in  the  quantification  of  molecular  similarity/dissimilarity. 

The  second  goal  consists  of  the  development  of  hierarchical  QSAR  models  for  predicting  the  toxic 
potential  of  chemicals  using  topological  and  quantum  chemical  indices.  Initially,  we  will  use  parameters 
calculated  by  semi-empirical  methods  such  as  MOPAC  and  AMPAC.  Parameters  calculated  by  ab  initio 
quantum  chemical  methods  will  be  used  in  limited  cases  of  QSAR  model  development,  if  they  are 
considered  necessary. 

The  third  goal  of  the  project  will  be  the  prediction  of  human  health  hazard  and  ecotoxicological  effects  of 
chemicals  using  QSAR  and  QMSA  methods  developed  in  the  project.  Attempts  will  be  made  to  estimate 
endpoints,  such  as,  carcinogenicity,  mutagenicity,  xenoestrogenicity,  acute  toxicity,  transport  of 
chemicals  through  the  blood-brain  barrier,  biodegradation,  and  bioconcentration  factor. 

The  fourth  goal  will  involve  the  utilization  of  QMSA  and  QSAR  methods  developed  as  part  of  this  project 
in  predicting  the  potential  toxicity  of  deicing  agents. 
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Status  of  Efforts 


During  the  first  year  of  the  project  the  majority  of  effort  was  spent  in  the  development  of  novel 
hierarchical  QSAR  methods,  QMSA  techniques  and  the  applications  of  these  methods  in  the  prediction 
of  toxicological,  physicochemical  and  biomedicinal  properties  of  different  sets  of  chemicals.  Our 
dissimilarity  methods  were  used  to  group  JP-8  constituents  into  a  small  number  of  clusters  that  can  be 
used  in  selecting  surrogate  mixtures  for  JP-8  in  the  Air  Force’s  toxicological  studies.  The  clustering  was 
done  using  algorithmically  derived  molecular  descriptors  calculated  by  our  computer  program  POLLY. 
Such  parameters  can  be  calculated  for  any  molecular  structure,  real  or  hypothetical.  This  makes  the 
clustering  methods  independent  of  any  experimentally  determined  property  of  the  JP-8  constituents. 

During  the  second  year  of  the  project,  our  effort  was  directed  towards  the  development  of  novel  optimal 
molecular  descriptors,  the  development  and  use  of  new  topological  indices,  the  study  of  the 
intercorrelation  of  a  large  number  of  molecular  descriptors,  and  the  use  of  calculated  molecular 
descriptors  in  the  prediction  of  toxicological  and  toxicologically-relevant  properties.  We  also  explored  the 
possibility  of  developing  integrated  QSAR  (l-QSAR)  with  the  combination  of  chemodescriptors  derived 
from  computational  chemistry  and  biodescriptors  derived  from  biological  techniques  such  as  proteomics. 

The  third  year  of  the  project  has  focused  on  the  further  expansion  of  our  theoretical  molecular  descriptor 
set  through  the  further  development  of  new  topological  indices  and  the  acquisition  of  several  other  well- 
known  software  packages  for  the  calculation  of  molecular  descriptors,  viz.,  CODESSA  v2.0  and 
Molconn-Z  v3.50.  Along  with  this  expansion,  we  have  continued  our  pioneering  studies  in  the 
intercorrelation  of  large  molecular  descriptor  sets  and  the  use  of  this  expanded  descriptor  set  in  the 
prediction  of  toxicological  and  toxicologically-relevant  properties.  We  have  also  begun  the  initial 
exploration  of  the  creation  of  biodescriptors,  derived  from  matrix  invariants,  to  handle  data  from 
proteomics  maps  and  have  developed  several  new  methods  for  the  characterization  of  DNA  sequences. 


Accomplishments/  New  Findings 

The  following  is  the  summary  of  accomplishments  of  the  various  tasks  of  the  project  during  the  reporting 
period: 


Task  1:  Development  of  Databases 

Years  1  &  2  Databases  of  toxicological  endpoints  and  physicochemical  properties  have  been 
developed  from  published  literature.  Such  data  have  been  used  in  the  hierarchical  QSAR  and 
QMSA  studies  (vide  infra). 

Year  3  Efforts  to  develop  more  databases  from  published  literature  have  tapered  off,  with 

more  emphasis  being  placed  on  other  aspects  of  the  project.  However,  we  have  been  making 
efforts  to  acquire  a  number  of  large,  proprietary  databases  from  various  companies  for  the 
purposes  of  testing  some  of  our  methods  against  “rear  drug-development  databases. 


Task  2:  Development  of  a  Comprehensive  Computer  Program  for  Calculating 

Topological  Molecular  Descriptors 

Years  1  &  2  POLLY  can  calculate  more  than  one  hundred  topological  indices  (TIs).  We  have 
been  working  to  develop  algorithms  to  calculate  other  topological  descriptors  such  as  local 
invariants.  Such  indices  will  be  tested  in  hierarchical  QSAR  and  QMSA  research. 
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Year  3  A  new  software  module  associated  with  POLLY  has  been  developed  and  is 

currently  being  tested.  This  module,  called  TRIPLET,  can  calculate  1 00  local  vertex  invariants 
(LOVIs)  which  are  also  known  as  triplet  indices. 


Task  3:  Integration  of  Graph  Theory  and  Quantum  Chemistry  for  QSAR 

Years  1  &  2  Ongoing  research  in  this  area  focused  on  the  use  of  weighted  graphs, 
pseudographs  in  the  development  of  novel  descriptors.  This  will  lead  to  novel  invariants  that  can 
encode  information  not  quantified  by  existing  molecular  descriptors.  In  the  second  year  of  the 
project,  a  paper  was  submitted  for  publication  that  studied  the  interrelationship  of  over  200 
topological  indices. 

Year  3  The  intercorrelation  study  submitted  last  year  was  published  this  spring  in  the 

Journal  of  Chemical  Information  and  Computer  Science  (Basak  et  al.  2000).  This  study  is  being 
followed  with  a  more  rigorous  study  involving  using  a  larger  set  of  318  indices  on  an  expanded 
set  of  databases.  Additionally,  our  findings  that  in  many  cases  quantum  chemical  indices  do  no 
better  than  topological  indices  in  QSAR  modeling  are  being  borne  out  by  the  work  of  other 
researchers. 


Task  6:  Characterization  of  Structure  Using  Theoretical  Structural  Descriptors 

Years  1  &  2  We  have  used  topological  indices  and  principal  components  (PCs)  derived  from 
them  in  the  characterization  of  a  set  of  isospectral  graphs  which  cannot  be  discriminated  by  the 
eigenvalues  of  the  adjacence  matrix  of  molecular  graphs.  This  result  was  published  in  the  Journal 
of  Chemical  Information  and  Computer  Sciences  (Balasubramanian  and  Basak  1998). 

Attempts  have  been  made  to  devise  descriptors  that  characterize  chemical  structures 
optimally.  This  has  been  done  through  the  use  of  weighted  graphs.  Invariants  based  on  line 
graphs  have  also  been  used  for  QSAR  studies.  Both  of  these  techniques  involve  the  development 
of  novel  descriptors  for  the  characterization  of  molecular  structure. 

Year  3  Work  on  optimized  molecular  descriptors  with  Dr.  Randic  has  continued,  resulting 

in  a  number  of  new  publications.  Additionally,  this  work  has  spread  into  new  fields  with  our 
development  of  methods  to  characterize  protein  structure  and  folding  through  the  use  of  novel 
invariants. 


Task  7:  Development  of  Hierarchical  QMSA  Models 

Years  1  &  2  Topostructural,  topochemical,  geometrical  as  well  as  quantum  chemical 
parameters  have  been  used  in  the  development  of  QMSA  methods.  We  carried  out  a 
dissimilarity-based  clustering  of  JP-8  constituents  into  fourteen  clusters.  A  mixture  of  compounds 
selected  from  each  cluster  can  be  used  as  surrogates  for  the  complex  JP-8  mixture. 

The  method  has  also  been  used  in  the  clustering  of  a  large,  virtual,  combinatorial  library  of 
Psoralen  derivatives.  The  results  of  this  analysis  were  presented  in  five  papers  at  the 
International  Biophysics  Congress,  New  Delhi,  September  19-23,  1999. 

Year  3  Additional  studies  involving  the  development  and  refinement  of  the  hierarchical 

QMSA  method  were  presented  at  the  Second  Indo-US  Workshop  on  Mathematical  Chemistry, 
Duluth,  MN,  May  30-June  3,  2000  and  at  the  National  American  Chemical  Society  meeting, 
Washington,  D.C.,  August  20-24,  2000. 
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Task  8:  Development  of  Hierarchical  Approach  to  QSAR 

Year  1  &  2  Quantum  chemical  parameters  calculated  by  semiempirical  methods  have  been 

used  in  hierarchical  QSAR  models  for  predicting  toxicity  and  toxicologically  relevant 
physicochemical  properties.  Several  manuscripts  have  been  published  in  peer-reviewed  journals. 

Our  hierarchical  approach  has  been  used  in  the  development  of  QSAR  models  for  the 
prediction  of  toxicity  (e.g.,  aquatic  toxicity,  LC50,  of  a  set  of  benzene  derivatives,  skin  penetration 
by  polycyclic  aromatic  hydrocarbons,  mutagenicity,  etc).  We  have  used  mainly  linear  statistical 
methods  such  as  variable  clustering,  principal  components  analysis,  etc,  for  model  building.  In  the 
area  of  neural  net  analysis,  we  used  linear  as  well  as  nonlinear  methodology.  In  the  case  of 
toxicity  of  benzene  derivatives,  there  were  some  improvements  in  the  model  over  the  linear 
statistical  methods  by  the  applications  of  neural  net  methodology. 

Year  3  Findings  of  recent  hierarchical  QSAR  modeling  studies  were  presented  at  both  the 

Second  Indo-US  Workshop  on  Mathematical  Chemistry  and  at  the  National  American  Chemical 
Society  meeting.  We  have  continued  working  to  examine  the  relative  effectiveness  of  linear  and 
non-linear  statistical  methods  versus  linear  and  non-linear  neural  network  methods  and  has 
resulted  in  the  publication  of  two  manuscripts  and  the  submission  of  two  other  studies  for  peer^ 
review  and  publication. 

Work  on  the  development  of  novel  biodescriptors  has  been  progressing  well.  Our 
collaborative  efforts  aim  at  the  development  of  a  series  of  novel  invariants  for  the  characterization 
of  proteomics  maps.  We  hope  to  continue  these  studies  to  move  beyond  the  theoretical  stage  to 
develop  software  to  calculate  these  invariants  and  to  test  them  in  QSAR  model  development. 
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Correlation  between  structure  and  normal  boiling  point  of  acyclic  carbonyl  compounds,  A.  T.  Balaban,  D. 
Mills  and  S.  C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  39,  758-764, 1999. 

Hazard  assessment  modeling:  An  evolutionary  ensemble  approach,  D.W.  Opitz,  S.C.  Basak  and  B.D. 

Gute,  In:  Genetic  and  Evolutionary  Computation,  Eds.  W.  Banzhaf,  J.  Daida,  A.E.  Eiben,  M.H.  Garzon,  V. 
Honavar,  M.  Jakiela,  &  R.E.  Smith,  Morgan  Kaufmann:  San  Francisco,  1999,  p  1643-1651. 

Information  theoretic  indices  of  neighborhood  complexity  and  their  applications,  S.C.  Basak,  In  Topological 
Indices  and  Related  Descriptors  in  QSAR  and  QSPR,  Eds.  J.  Devillers  and  A.T.  Balaban,  Gordon  and 
Breach  Science  Publishers,  Amsterdam,  1999,  p  563-593. 

Normal  boiling  points  of  l.o-alkanedinitriles:  The  highest  increment  in  a  homologous  series,  A.T.  Balaban, 
S.C.  Basak  and  D.  Mills,  J.  Chem.  Inf.  Comput.  Sci.,  39,  769-774,  1999. 

Optimal  molecular  descriptors  based  on  weighted  path  numbers,  M.  Randic  and  S.  C.  Basak,  J.  Chem.  Inf. 
Comput.  Sci.,  39,  261-266,  1999. 

Prediction  of  complement-inhibitory  activity  of  benzamidines  using  topological  and  geometric  parameters, 
S.C.  Basak,  B.D.  Gute,  and  S.  Ghatak,  J.  Chem.  Inf.  Comput.  Sci.,  39,  255-260,  1999. 

Prediction  of  the  dermal  penetration  of  polycyclic  aromatic  hydrocarbons  (PAHs):  a  hierarchical  QSAR 
approach,  B.  D.  Gute,  G.  D.  Grunwald,  and  S.  C.  Basak,  SAR.  QSAR  Environ.  Res.,  10,  1-15, 1999. 
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Use  of  statistical  and  neural  net  methods  in  predicting  toxicity  of  chemicals:  A  hierarchical  QSAR  approach, 
S  C  Basak  B.D.  Gute,  G.D.  Grunwald,  D.W.  Opitz  and  K.  Balasubramanian,  In  Predictive  Toxicology  of 
Chemicals:  Experiences  and  Impact  of  Al  Tools  -  Papers  from  the  1999  AAAI  Symposium,  March  22-24, 

1999,  Stanford,  CA,  TR  SS-99-01,  AAAI  Press:  Menlo  Park,  CA,  1999,  p  108-111. 

2000  A  comparative  QSAR  study  of  benzamidines  complement-inhibitory  activity  and  benzene  derivatives  acute 
toxicity,  S.C.  Basak,  B.D.  Gute,  B.  Lucic,  S.  Nikolic  and  N.  Trinajstic,  Computers  &  Chemistry,  24, 181-191, 

2000.  ’ 

Construction  of  high-quality  structure-property-activity  regressions:  The  boiling  points  of  sulfides,  M.  Randic 
and  S.  C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  40,  899-905,  2000. 

Multiple  regression  analysis  with  optimal  molecular  descriptors,  M.  Randic  and  S.C.  Basak,  SAR  QSAR 
Environ.  Res.,  11, 1-23,  2000. 

On  3-D  graphical  representation  of  DNA  primary  sequences  and  their  numerical  characterization,  M. 
Randic,  M.  Vracko,  A.  Nandy  and  S.  C.  Basak,  J.  Comput.  Chem.,  40, 1235-1244,  2000. 

QSPR  modeling:  Graph  connectivity  indices  versus  line  graph  connectivity  indices,  S.  C.  Basak,  S.  Nikolic, 
N.  Trinajstic,  D.  Amic  and  D.  Beslo,  J.  Chem.  Inf.  Comput.  Sci.,  40,  927-933,  2000. 

Simple  numerical  descriptor  for  quantifying  effect  of  toxic  substances  on  DNA  sequences,  A.  Nandy  and  S. 
C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  40,  915-919,  2000. 

Topological  indices:  Their  nature  and  mutual  relatedness,  S.  C.  Basak,  A.  T.  Balaban,  G.  D.  Grunwald  and 
B.  D.  Gute,  J.  Chem.  Inf.  Comput.  Sci.,  40,  891-898,  2000. 

Use  of  graph  invariants  in  QMSA  and  predictive  toxicology,  S.C.  Basak  and  B.D.  Gute,  In  Discrete 
Mathematical  Chemistry,  Eds.  P.  Hansen,  P.  Fowler,  M.  Zheng,  DIMACS  Series  51,  American 
Mathematical  Society:  Providence,  Rhode  Island,  2000,  pages  9-24. 

Use  of  statistical  and  neural  net  approaches  in  predicting  toxicity  of  chemicals,  S.  C.  Basak,  G.  D. 
Grunwald,  B.  D.  Gute,  K.  Balasubramanian  and  D.  Opitz,  J.  Chem.  Inf.  Comput.  Sci.,  40,  885-890,  2000. 


In  press  . 

Molecular  similarity  based  estimation  of  properties:  A  comparison  of  structure  spaces  and  property  spaces, 
B.D.  Gute,  G.D.  Grunwald,  D.  Mills  and  S.C.  Basak,  SAR  QSAR  Environ.  Res.,  2000. 

On  characterization  of  physical  properties  of  amino  acids,  M.  Randic,  D.  Mills  and  S.  C.  Basak,  Int.  J. 

Quant.  Chem.,  2000. 

On  ordering  of  folded  structures,  M.  Randic,  M.  Vracko,  M.  Novic  and  S.  C.  Basak,  Mathematical 
Chemistry,  MATCH,  2000. 

Quantitative  comparison  of  five  molecular  structure  spaces  in  selecting  analogs  of  chemicals,  S.C.  Basak, 
B.D.  Gute,  and  G.D.  Grunwald,  Mathl.  Model.  Comput.  Sci.,  2000. 

Reverse  Wiener  index,  A.  T.  Balaban,  D.  Mills  and  S.  C.  Basak,  Croat.  Chim.  Acta,  2000. 

Use  of  mathematical  structural  invariants  in  analysing  combinatorial  libraries:  A  case  study  with  Psoralen 
derivatives  S.C.  Basak,  D.  Mills,  B.D.  Gute,  A.T.  Balaban,  K.  Basak  and  G.D.  Grunwald,  In  Some  Aspects 
of  Mathematical  Chemistry,  Eds.  D.K.  Sinha,  S.C.  Basak,  R.K.  Mohanty  and  I.N.  Basumallick,  Visva-Bharati 
University:  Santiniketan,  West  Bengal,  India,  2000. 

Variable  molecular  descriptors,  M.  Randic  and  S.C.  Basak,  In  Some  Aspects  of  Mathematical  Chemistry, 
Eds.  D.K.  Sinha,  S.C.  Basak,  R.K.  Mohanty  and  I.N.  Basumallick,  Visva-Bharati  University:  Santiniketan, 
West  Bengal,  India,  2000. 
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Accepted 

Modelling  the  solubility  of  aliphatic  alcohols  in  water.  Graph  connectivity  indices  versus  line  graph 
connectivity  indices,  S.  Nikolic,  N.  Trinajstic,  D.  Amic,  D.  Beslo  and  S.  C.  Basak,  In  QSAR/QSPR  Studies 
by  Molecular  Descriptors,  M.  V.  Diudea,  Ed.,  Nova  Science  Publishers,  New  York,  USA,  2000. 

Submitted 

A  neural  net-based  QSAR  algorithm  (PCANN)  and  its  comparison  with  hologram-  and  multiple  linear 
regression-based  QSAR  approaches  applied  to  1,4-dihydropyridine-based  calcium  channel  antagonists, 
V.N.  Viswanadhan,  G.A.  Mueller,  S.C.  Basak  and  J.N.  Weinstein,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

A  new  descriptor  for  structure-property  and  structure-activity  correlations,  M.  Randic  and  S.C.  Basak,  J. 
Chem.  Inf.  Comput.  Sci.,  2000. 

A  novel  2-D  graphical  representation  of  DNA  sequences  of  low  degeneracy,  X.  Guo,  M.  Randic  and  S.C. 
Basak,  Chem.  Phys.  Lett.,  2000. 

Characterization  of  DNA  primary  sequences  based  on  the  average  distances  between  bases,  M.  Randic 
and  S.  C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

Distance  indices  and  their  hyper-counterparts:  Intercorrelation  and  use  in  the  structure-property  modeling, 
N.  Trinajstic,  S.  Nikolic,  S.C.  Basak  and  I.  Lukovits,  SAR  QSAR  Environ.  Res.,  2000. 

On  structural  interpretation  of  distance  related  topological  indices,  M.  Randic,  A.T.  Balaban  and  S.C. 
Basak,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

On  the  characterization  of  DNA  primary  sequences  by  triplet  of  nucleic  acid  bases,  M.  Randic,  X.  Guo  and 
S.C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

On  use  of  the  variable  connectivity  index  Y  in  QSAR:Toxicity  of  aliphatic  ethers,  M.  Randic  and  S.  C. 
Basak,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

Prediction  of  mutagenicity  of  aromatic  and  heteroaromatic  amines  from  structure:  A  hierarchical  QSAR 
approach,  S.C.  Basak,  D.R.  Mills  and  A.T.  Balaban,  J.  Chem.  Inf.  Comput.  Sci.,  2000. 

QSAR  with  few  compounds  and  many  features,  D.M.  Hawkins,  S.  C.  Basak  and  X.  Shi,  J.  Chem.  Inf. 
Comput.  Sci.,  2000. 

Copies  of  manuscripts  published  since  the  1999  year-end  report  are  attached  as  Appendix  1.  Copies  of  the 
manuscripts  at  various  levels  of  review  and  publication  have  been  omitted  for  the  sake  of  brevity. 


Interactions/  Transitions 

Transitions 

1 .  Applied  computational  methods  in  the  design  of  a  set  of  six  anti-epileptic  carbamates  by  Professor 
Alexandru  T.  Balaban,  Vice  President,  Rumanian  Academy  of  Sciences. 

2.  Worked  with  Dr.  James  Riviere,  North  Carolina  State  University,  in  the  clustering  of  JP-8  components 
using  dissimilarity  methods  developed  at  NRRI. 

3.  Worked  with  Dr.  Alexander  Gybin,  The  Chormaline  Corporation,  Duluth,  MN  in  the  computer-assisted 
design  of  photoactive  chemicals 

4.  Applied  computational  methods  in  the  design  of  a  set  of  novel  photoactive  chemicals  by  Professor 
Alexandru  T.  Balaban,  Vice  President,  Rumanian  Academy  of  Sciences  (with  Dr.  Alexander  Gybin, 
Chormaline  Corporation,  Duluth,  MN). 
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5.  Worked  with  Dr.  Frank  Witzmann,  IUPUI,  in  the  development  of  integrated  QSAR  methods  using 
chemodescriptors  and  biodescriptors. 

6.  Worked  with  Dr.  Hirak  Basu,  SLIL  Technology,  Madison,  Wl,  to  generate  a  virtual  library  of  about 
80,000  chemicals  to  carry  out  dissimilarity  based  design  of  novel  anticancer  drugs  using  POLLY 
parameters. 

7.  Worked  with  Dr.  Marjan  Vracko,  National  Institute  of  Chemistry,  Ljubljana,  Slovenia,  to  apply  our 
hierarchical  QSAR  approach  to  predict  the  toxicity  of  chemicals  of  interest  to  the  European  community. 

8.  Currently  working  on  a  long-term  collaborative  project  with  Dr.  Indira  Ghosh,  Astra/Zeneca,  Bangalore, 
India,  to  implement  and  use  topological  indices  for  clustering  and  analysis  of  their  large,  proprietary 
databases  for  the  discovery  of  novel  lead  compounds. 


Meetings/  Seminars/  Invited  Presentations 

1.  Dr.  S.C,  Basak  was  the  Co-Chairperson  of  the  First  Indo/US  Workshop  on  Mathematical  Chemistry, 
organized  jointly  by  NRRI  and  Visva  Bharati  University,  Santiniketan,  West  Bengal  India,  Jan  9-13, 

1998.  Basak  presented  the  following  papers  at  the  workshop: 

i.  Graph  invariants,  molecular  similarity  and  QSAR  co-authored  by  B.D.  Gute  and  G.D.  Grunwald. 

ii.  Weighted  paths  as  novel  optimal  molecular  descriptors  authored  jointly  by  M.  Randic  and  Basak. 

iii.  The  utility  of  hierarchical  model  development  in  examining  the  structural  basis  of  properties 
authored  by  B.D.  Gute,  G.D.  Grunwald  and  Basak. 

iv.  Weighted  K-nearest  neighbors  property  estimation  in  molecular  similarity  authored  by  G.  D. 
Grunwald,  B.D.  Gute  and  Basak. 

v.  Dissimilarity  based  clustering  of  psoralen  derivatives  in  the  topological  structure  space:  A  strategy 
for  drug  design  authored  by  Basak,  G.D.  Grunwald,  D.  Panja,  K.  Basak  and  B.D.  Gute. 

2.  Dr.  S.C.  Basak  gave  several  invited  lectures  at  various  national  and  international  symposia  during  his 
stay  in  India  from  December  23, 1997  through  January  31, 1998.  These  lectures  included: 

i.  A  distinguished  lecture  Rational  drug  design  and  Ayurvedic  medicine  at  the  conference  organized 
by  the  Association  of  Ayurvedic  Doctors  of  India  (AADI),  January  4, 1998. 

ii.  An  invited  lecture  on  Use  of  computational  methods  and  Ayurvedic  knowledge  in  modem  drug 
discovery  at  the  conference  AYURVEDA  TODAY,  January  8, 1998. 

iii.  An  invited  seminar  on  Assessment  of  genotoxicity  of  chemicals  from  structure:  A  computational 
approach  at  the  Annual  Conference  of  the  Indian  Association  for  Cancer  Congress,  Calcutta, 
January  21-24, 1998.  The  lecture  was  co-authored  by  B.D.  Gute  and  G.D.  Grunwald. 

3.  Dr.  S.C.  Basak  chaired  a  session  at  the  DIMACS  Workshop  on  Discrete  Mathematical  Chemistry, 

March  23-25, 1998,  held  at  Rutgers  University,  New  Jersey.  He  also  presented  an  invited  paper 
entitled  Use  of  graph  invariants  in  QSAR  and  predictive  toxicology  at  the  conference  authored  jointly  by 
Basak,  B.D.  Gute  and  G.D.  Grunwald. 

4.  Dr.  S.C.  Basak  gave  an  invited  presentation  entitled  A  computational  approach  to  predicting  toxicity: 
Possible  applications  to  JP8  jet  fuel  at  the  First  International  Conference  on  the  Environmental  Health 
and  Safety  of  Jet  Fuels,  organized  jointly  by  US  Air  Force,  National  Institute  of  Occupational  Safety  and 
Health,  USEPA  National  Exposure  Research  Laboratory  and  American  Industrial  Hygiene  Association, 
April  1-3,  1998,  San  Antonio,  TX. 

5.  Dr.  S.C.  Basak  presented  the  following  papers  at  the  International  Conference  Computational  Methods 
in  Toxicology  held  April  20-22, 1998,  Dayton,  OH: 

i.  Use  of  computational  methods  in  predicting  potential  toxicity  of  chemicals  authored  jointly  by 
Basak,  B.D.  Gute  and  G.D.  Grunwald. 

ii.  On  construction  of  optimal  molecular  descriptors  authored  jointly  by  M.  Randic  and  Basak. 
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iii.  Predicting  mode  of  action  of  chemicals  from  structure:  A  hierarchical  approach  authored  jointly  by 
Basak,  G.D.  Grunwald  and  B.D.  Gute. 

iv.  A  hierarchical  approach  to  predictive  toxicology  using  computed  molecular  descriptors  authored 
jointly  by  B.D.  Gute,  G.D.  Grunwald  and  Basak 

6.  Dr.  S.C.  Basak  presented  a  paper  Dissimilarity-based  clustering  of  psoralen  derivatives  in  the 
topological  structure  space:  A  strategy  for  drug  design  at  the  Second  Annual  Chemoinformatics 
Workshop,  organized  by  the  Cambridge  Health  Institute,  Boston,  MA,  June  15-16,  1998.  The  paper 
was  co-authored  by  G.  D  Grunwald  and  B.D.  Gute. 

7  Dr  S.C.  Basak  presented  an  invited  seminar  Novel  drug  design  methods:  Assessing  activity  and 

toxicity  using  computational  chemistry  at  the  Department  of  Molecular  Biology  and  Genetics,  U  niversity 
of  Guelph,  Ontario,  Canada,  July  3, 1998. 

8.  Dr.  S.C.  Basak  presented  the  invited  lecture  Use  of  theoretical  structural  descriptors  in  molecular 
design  and  hazard  assessment  of  chemicals  to  the  scientists  of  the  computer-aided  drug  design 
company  NANODESIGN,  INC,  Toronto,  Canada,  July  6, 1998. 

9.  Dr.  S.C.  Basak  attended  the  First  Environmental  Management  Science  Program  Workshop  organized 
jointly  by  the  American  Chemical  Society  and  the  Office  of  Environmental  Management,  Department  of 
Energy,  Chicago,  IL,  July  27-30, 1998. 

10.  Dr.  S.C.  Basak  presented  the  invited  lecture  Theoretical  molecular  descriptors  for  the  prediction  of 
bioactivity  /toxicity,  selection  of  analogs,  discovery  and  optimization  of  leads  authored  jointly  by  Basak, 
B.D.  Gute,  G.D.  Grunwald  and  A.T.  Balaban  at  the  Astra  Symposium  on  Advances  in  Medicinal 
Chemistry  organized  by  the  Astra  company,  Bangalore,  September  17-19, 1998. 

11.  Dr.  S.C.  Basak  presented  the  invited  lecture  Prediction  of  bioactivity  of  chemicals  from  structure:  A 
computational  approach  at  the  Indian  Institute  of  Science,  Bangalore,  India,  September  20, 1 998. 

12.  Dr.  S.C.  Basak  presented  the  invitedlecture  Integration  of  traditional  Indian  medicine  and 
chemoinformatics  for  rapid  drug  discovery  at  the  conference  organized  jointly  by  East  India 
Pharmaceutical  Company,  Calcutta,  October  12,  1998. 

1 3  B  D  Gute  presented  an  invited  talk  A  hierarchical  QSAR  approach  to  predicting  carcinogenicity  of 
chemicals  authored  jointly,  by  S.C.  Basak,  Gute  and  G.D.  Grunwald,  at  the  19  Annual  Society  of 
Environmental  Toxicology  and  Chemistry  meeting,  Charlotte,  North  Caroline,  November  15-19, 1998. 

14.  Dr.  S.C.  Basak  presented  the  invited  lecture  Clustering  of  JP-8  constituents  into  structurally  dissimilar 
groups ■  A  novel  computational  strategy  for  predictive  toxicology  authored  jointly  by  Basak  and  G.D. 
Grunwald,  at  the  Air  Force  Office  of  Scientific  Research  JP-8  Jet  Fuel  Toxicology  Workshop,  held  at 
the  University  of  Arizona,  Tucson,  AZ,  December  2-3, 1998. 

1 5.  Dr.  S.C.  Basak  presented  the  invited  lecture  on  Novel  drug  discovery  methods:  Predicting 
pharmacological  and  toxicological  properties  of  chemicals  using  computational  chemistry  at  the 
Meharry  Medical  College,  Nashville,  TN,  January  19,  1999. 

16.  Dr.  S.C.  Basak  delivered  the  first  distinguished  lecture  in  Mathematical  Chemistry  on  From  graph 
invariants  to  molecular  design:  25  years  after  the  connectivity  index  at  Visva  Bharati  University, 
Santiniketan,  West  Bengal,  India,  February  11, 1999. 

17.  Dr.  S.C.  Basak  presented  the  invited  seminar  Theoretical  molecular  descriptors  for  the  prediction  of 
bioactivity,  toxicity,  selection  of  analogs,  discovery  and  optimization  of  leads  at  the  Wockhardt 
Research  Centre,  Aurangabad,  Maharashtra,  India,  on  February  15,  1999. 

18.  Dr.  S.C.  Basak  presented  the  invited  lecture  Prediction  of  bioactivity  of  chemicals  from  structure.  A 
hierarchical  computational  approach  at  Bharatiya  Vidya  Bhavans  Swami  Prakashananda  Ayurvedic 
Research  Center,  Mumbai,  India,  on  February  18,  1999. 
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19.  Dr.  S.C.  Basak  presented  the  invited  lecture  on  Toxicology  in  siiico:  Addressing  the  quagmire  of 
environmental  pollution  and  protecting  public  health  using  computational  chemistry  authored  jointly  by 
Basak,  B.D.  Gute,  David  Opitz  and  G.D.  Grunwald  at  the  International  Symposia  Series:  Reducing  the 
Environmental  Impacts  of  Toxic  Chemicals  in  Asian  Economies.  The  Impacts  of  Toxic  Chemicals  and 
Pollutants  on  Public  Health,  the  Ecology  and  the  Environment  of  the  Bengal  Basin  -  Bangladesh  and 
India,  Dhaka  Bangladesh,  on  March  1, 1999. 

20.  Dr.  S.C.  Basak  presented  the  invited  seminar  on  Novel  drug  discovery  methods:  Predicting 
pharmacological  and  toxicological  properties  of  chemicals  using  computational  chemistry  at  the  School 
of  Pharmacy,  Dhaka  University,  Dhaka,  Bangladesh  on  March  4, 1999. 

21.  Dr.  S.C.  Basak  presented  the  invited  talk  Computational  toxicology:  A  cost  effective  approach  for  the 
protection  of  human  and  environmental  health  at  the  International  Conference  at  Santiniketan,  India, 
March  7,  1999. 

22.  Dr.  S.C.  Basak  gave  the  invited  presentation  Estimation  of  DNA  damage  from  toxic  chemicals  by 
graphical  techniques  authored  jointly  by  A.  Nandy,  C.  Raychaudhury,  S.  Ghosh,  and  Basak  on  March 
8,  1999. 

23.  Dr.  S.C.  Basak  attended  the  at  the  International  Conference  Smarter  Lead  Optimization:  Easing  the 
Bottleneck  organized  by  Cambridge  Health  Institute,  March  18-19,  1999,  San  Diego,  CA  and  gave  the 
following  presentations: 

i.  A  computational  approach  to  predicting  toxicity  and  toxic  modes  of  action  of  chemicals  from 
structure. 

ii.  Topological  indices  as  molecular  descriptors  for  lead  optimization  authored  jointly  by  A.T.  Balaban 
and  Basak. 

24.  Dr.  S.C.  Basak  attended  the  American  Association  of  Artificial  Intelligence  conference,  Predictive 
Toxicology  of  Chemicals:  Experiences  and  Impact  of  Al  Tools,  Stanford  University,  March  22-24, 1999 
to  present  the  following  lectures: 

i.  Use  of  statistical  and  neural  net  methods  in  predicting  toxicity  of  chemicals:  A  hierarchical  QSAR 
approach  authored  jointly  by  Basak,  G.D.  Grunwald,  B.D.  Gute,  K.  Balasubramanian  and  D.  Opitz. 

ii.  A  Graphical  Technique  for  Preliminary  Assessment  of  Effects  on  DNA  Sequences  from  Toxic 
Substances  authored  jointly  by  A.  Nandy,  C.  Raychaudhury  and  Basak. 

25.  Dr.  Basak  presented  the  following  papers  at  the  QSAR  Gordon  Conference,  July  25-30,  1999,  Tilton, 
New  Hampshire: 

i.  A  hierarchical  QSAR  approach  for  predicting  property/activity  of  chemicals  authored  by  Basak, 

G.D.  Grunwald,  B.D.  Gute,  D.  Mills,  K.  Balasubramanian  and  A.T.  Balaban. 

ii.  Topological  indices  as  molecular  descriptors  for  QSAR  authored  by  A.T.  Balaban  and  Basak. 

26.  On  a  trip  to  Europe  and  India  during  September  of  1999,  Dr.  S.C.  Basak  gave  the  following  invited 
presentations: 

i.  A  hierarchical  qsar  approach  for  predicting  property/activity  of  chemical  from  structure  at  the  Rugjer 
Boskovic  Institute,  Zagreg,  The  Republic  of  Croatia. 

ii.  Predicting  property/activity/toxicity  of  chemicals  from  structure:  A  hierarchical  QSAR  approach  at 
the  National  Institute  of  Chemistry,  Slovenia. 

iii.  Prediction  of  activityAoxicity  of  chemicals  from  structure  using  graph  invariants  at  the  Visva  Bharati 
University,  Santiniketan,  West  Bengal,  India. 

iv.  Predicting  biomedicinal  and  toxicological  properties  of  chemicals  using  molecular  descriptors  at  the 
University  of  Delhi,  India. 

v.  The  utility  of  Ayurvedic  medicine  for  modern  drug  discovery:  An  exploratory  analysis  at  the 
conference  organized  by  the  East  India  Pharmaceutical  Company,  Calcutta. 

27.  During  his  trip  to  India  in  September  of  1999,  Subhash  Basak  also  attended  the  13th  International 
Biophysics  Congress,  New  Delhi,  and  presented  the  following  papers: 
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i.  Clustering  of  Psoralen  derivatives  using  topological  invariants:  A  strategy  for  molecular  design 
coauthored  by  G.D.  Grunwald,  A.T.  Balaban  and  K.  Basak. 

ii.  A  hierarchical  QSAR  approach  to  predicting  bioactivity  of  chemicals  using  theoretical  molecular 
descriptors  coauthored  by  B.D.  Gute,  D.  Mills,  G.D.  Grunwald,  D.  Opitz  and  K.  Balasubramanian. 

iii.  Modeling  the  solubility  of  aliphatic  alcohols  in  water,  graph  connectivity  indices  versus  line  graph 
connectivity  indices  coauthored  by  D.  Amic,  S.  Nikolic,  N.  Trinajstic  and  D.  Beslo. 

iv.  Design  of  high  quality  structure-property  regressions  coauthored  by  M.  Randic. 

v.  On  numerical  characterization  of  DNA  primary  sequences  coauthored  by  M.  Randic,  M.  Vracko  and 
A.  Nandy. 

28.  Dr.  Basak  gave  an  invited  presentation  on  Development  of  hierarchical  qsar  models  for  predicting 
toxicity  of  chemicals:  Statistical  and  neural  net  approaches  at  the  Air  Force  Predictive  Toxicology 
Conference,  Wright  Patterson  Air  Force  Base,  Dayton,  OH. 

29.  Subhash  Basak  gave  an  invited  presentation  Exploring  the  scientific  basis  of  Ayurvedic  medicine:  A 
computational  approach  at  the  conference  Beyond  Conventional  Healthcare:  Understanding  Alternative 
Choices  organized  by  the  University  of  Wisconsin,  Superior,  Nov.,  1999. 

30.  Dr.  Basak  participated  in  the  1999  Partners  in  Environmental  Technology  Symposium  and  Workshop 
held  in  Washington,  D.C. 

31.  Subhash  Basak  presented  the  invited  lecture  Applications  of  theoretical  molecular  descriptors  in  drug 
discovery  and  predictive  toxicology:  A  computational  approach  at  the  University  of  Montana,  Missoula. 

32.  Dr.  Basak  gave  the  invited  presentation  Clustering  ofJP-8  chemicals  using  structure  spaces  and  ■ 
property  spaces:  A  computational  approach  authored  jointly  by  B.D.  Gute,  G.D.  Grunwald,  D.  Mills,  J. 
Riviere  and  D.  Opitz  at  the  Air  Force  Office  of  Scientific  Research  JP-8  Jet  Fuel  Toxicology  Workshop, 
University  of  Arizona,  Tucson,  Jan.,  2000. 

33.  Subhash  Basak  gave  the  following  invited  lectures/  presentations  during  his  trip  to  India,  Feb.,  2000: 

i.  Predicting  biomedical  and  toxicological  properties  of  chemicals  using  molecular  descriptors:  A 
hierarchical  QSAR  approach  at  the  International  Conference  on  Medicinal  Chemistry  and 
Biocatalysis  organized  by  Delhi  University.  He  also  presented  the  following  four  posters  in  the 
same  conference: 

(a)  Clustering  ofJP-8  chemicals  using  structure  spaces  and  property  spaces:  A  computational 
approach  authored  jointly  by  Basak,  B.D.  Gute,  G.D.  Grunwald,  D.  Mills,  J.  Riviere  and  D. 

Opitz. 

(b)  Prediction  of  gas  chromatographic  retention  indices  using  variable  connectivity  index  authored 
jointly  by  M.  Randic,  Basak,  M.  Pompe  and  M.  Novic. 

(c)  Clustering  of  Psoralen  derivatives  using  topological  invariants:  A  strategy  for  molecular  design 
authored  jointly  by  Basak,  D.  Mills,  A.T.  Balaban,  K.  Basak  and  G.D.  Grunwald. 

(d)  A  novel  structure-activity  approach  to  benzamidines  complement  inhibitory  activity  authored 
jointly  by  Basak,  B.  Lucic,  S.  Nikolic  and  N.  Trinajstic. 

ii.  Basak  also  gave  the  invited  presentation  Applications  of  theoretical  molecular  descriptors  in  drug 
discovery  and  predictive  toxicology:  A  computational  approach  at  the  Ranbaxy  Research 
Laboratories,  Udyog  Vihar  Industrial  Area,  Gurgaon,  Hariyana,  India. 

34.  D.  Mills  presented  the  paper  On  the  use  of  variable  connectivity  index  for  characterization  of  amino 
acids,  co-authored  by  Basak  and  M.  Randic,  at  the  40th  Sanibel  Symposium  on  Atomic,  Molecular, 
Biophysical  and  Condensed  Matter  Theory  organized  by  the  Quantum  Theory  Project,  at  the  University 
of  Florida,  March  2000. 

35.  Dr.  Basak  gave  the  presentation  Estimating  physicochemical  and  toxicological  properties  of  chemicals 
from  calculated  molecular  descriptors  co-authored  by  D.  Mills,  B.D.  Gute,  D.  Opitz  and  K. 
Balasubramanian  at  the  Dept,  of  Energy’s  Environmental  Management  Sciences  Program  National 
Workshop  in  Atlanta,  April,  2000. 
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36.  Subhash  Basak  gave  the  lecture  Predicting  property/activity/toxicity  of  chemicals  using  calculated 
molecular  descriptors  at  the  University  of  Florida,  Gainesville. 

37.  Dr.  Basak  and  co-workers  presented  the  following  papers  at  the  Second  Indo-US  Workshop  on 
Mathematical  Chemistry,  organized  by  NRRI  and  Visva  Bharati  University,  India: 

i.  A.T.  Balaban  presented  the  poster  On  the  clustering  of  Psoralens  co-authored  by  Basak,  D.  Mills, 

K.  Basak,  and  G.D.  Grunwald. 

ii.  B.D.  Gute  presented  the  poster  Molecular  similarity-based  estimation  of  properties:  A  comparison 
of  structure  spaces  and  property  spaces  co-authored  by  G.D.  Grunwald,  D.  Mills  and  S.C.  Basak. 

iii.  Dr.  Basak  presented  the  invited  lecture  A  hierarchical  QSAR  approach  for  predicting 
property/activityAoxicity  of  chemicals  using  theoretical  structural  descriptors  co-authored  by  B.D. 
Gute,  D.  Mills,  A.T.  Balaban,  D.  Opitz  and  K.  Balasubramanian. 

iv.  B.D.  Gute  presented  the  poster  Clustering  of  chemical  using  theoretical  structure  spaces:  A  case 
study  with  476  diverse  chemicals  co-authored  by  Basak,  G.D.  Grunwald  and  D.  Mills. 

v.  D,  Mills  presented  the  poster  Clustering  ofJP-8  chemicals  using  property  spaces  and  structure 
spaces:  A  novel  tool  for  hazard  assessment  co-authored  by  Basak,  G.D.  Grunwald,  B.D.  Gute  and 
J.E.  Riviere. 

vi.  M.  Randic  presented  the  poster  On  use  of  the  variable  connectivity  index  V ,n  QSAR:  Toxicity  of 
aliphatic  ethers  co-authored  by  Basak. 

vii. ’  A.T.  Balaban  presented  the  invited  lecture  Topological  indices  as  valuable  molecular  descriptors 

for  QSAR  and  QSPR  co-authored  by  O.  Ivanciuc,  D.  Mills  and  Basak. 

viii.  M.  Pompe  presented  the  poster  Prediction  of  gas  chromatographic  retention  indices  for  oxygenated 
compounds  using  variable  connectivity  index  1  z  co-authored  by  M.  Veber,  M.  Randic,  M.  Novic  and 
Basak. 

ix.  A.T.  Balaban  presented  the  poster  Topological  indices:  Their  nature  and  mutual  relatedness  co¬ 
authored  by  Basak,  G.D.  Grunwald  and  B.D.  Gute. 

38.  Dr.  Basak  and  collaborators  made  the  following  presentations  at  the  American  Chemical  Society 
Annual  meeting  recently  in  Washington,  D.C.: 

i.  A.T.  Balaban  presented  the  invited  lecture  Trends  and  possibilities  for  future  developments  of 
topological  indices  authored  jointly  by  Balaban  and  S.C.  Basak. 

ii.  B.D.  Gute  presented  the  invited  lecture  Use  of  graph  invariants  for  the  prediction  of 
property/activityAoxicity  of  chemicals  authored  jointly  by  S.C.  Basak,  Gute,  D.  Mills  and  A.T. 
Balaban. 

iii.  Dr.  Basak  presented  the  lecture  Similarity-based  estimation  of  properties:  A  comparison  of 
structure  spaces  authored  jointly  by  B.D.  Gute,  G.D.  Grunwald,  D.  Mills  and  S.C.  Basak. 

iv.  D.  Mills  presented  the  poster  Clustering  ofJP-8  chemicals  using  structure  spaces  and  property 
spaces:  A  computational  approach  authored  jointly  by  Mills,  S.C.  Basak,  G.D.  Grunwald,  B.D.  Gute 
and  J.  Riviere. 

v.  D.  Mills  presented  the  poster  Hierarchical  clustering  of  Psoralen  derivatives  using  topological 
invariants:  A  strategy  for  molecular  design  authored  jointly  by  Mills,  S.C.  Basak,  B.D.  Gute,  A.T. 
Balaban,  G.D.  Grunwald  and  K.  Basak. 

vi.  D.  Mills  presented  the  poster  Use  of  variable  connectivity  indices  on  biological  molecules  authored 
jointly  by  Mills,  M.  Randic  and  S.C.  Basak. 

39.  Dr.  Basak  visited  Milan,  Italy  (early  September  2000)  to  discuss  collaborative  projects  with  colleagues 
at  the  Istituto  di  Ricerche  Farmacologiche  "Mario  Negri"  and  Milan  Chemometric  Research  Group, 
Department  of  Environmental  Sciences.  He  traveled  to  Slovenia  and  Croatia,  to  develop  and  discuss 
joint  quantitative  structure-activity/toxicity/property  relationship  (QSAR /  QSPR /  QSTR)  research  papers 
and  projects  with  colleagues  at  the  National  Institute  of  Chemistry,  Ljubljana,  Slovenia  and  the  Rugjer 
Boskovic  Institute. 
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Honors  and  Awards 

1 .  Dr.  S.C.  Basak  was  the  Co-Chairperson  of  the  First  Indo-US  Workshop  on  Mathematical  Chemistry, 
organized  jointly  by  NRRI  and  Visva  Bharati  University,  Santiniketan,  West  Bengal  India,  Jan  9-13, 
1998. 

2.  Dr.  S.C.  Basak  chaired  a  session  at  the  DIMACS  Workshop  on  Discrete  Mathematical  Chemistry, 
March  23-25, 1998,  held  at  Rutgers  University,  New  Jersey. 

3.  Dr.  Basak  organized  a  one-day  workshop  on  Applied  Mathematical  Chemistry:  Molecular  Descriptors 
and  Their  Applications  in  Structure-Property-Activity-Toxicity  Relationships,  May  3,  1999,  at  NRRI. 
Thirteen  speakers  from  seven  different  countries,  viz.,  Bulgaria,  Croatia,  India,  Romania,  Slovenia, 
United  Kingdom  and  United  States,  gave  invited  presentations  on  their  latest  research  on  Mathematical 
Chemistry,  Quantitative  Structure-Activity  Relationships  (QSAR),  Computational  Chemistry  and 
Predictive  Toxicology. 

4.  Dr.  Basak  has  been  invited  to  become  a  member  of  the  International  Advisory  Committee  of  the 
International  Symposium  Current  Trends  in  Drug  Discovery  Research,  February  11-15,  2001,  to  be 
organized  by  the  Central  Drug  Research  Institute  (CDRI),  Lucknow,  India,  the  premier  drug  discovery 
and  research  institute  of  the  country.  The  symposium  is  being  organized  to  celebrate  the  50th 
Anniversary  of  CDRI. 

5.  Basak  has  been  invited  to  become  a  member  of  the  Indian  National  Organizing  Committee  of  the 
International  Symposium  Strategies  and  Perspectives  in  Drug  Development,  Design  and  Molecular 
Modeling  to  be  organized  by  the  Indian  Institute  of  Chemical  Biology,  Calcutta,  Oct.  17-18,  2000. 

6.  Dr.  S.C.  Basak  was  the  Co-Chairperson  of  the  Second  Indo-US  Workshop  on  Mathematical  Chemistry 
with  Applications  to  Drug  Discovery,  Environmental  Toxicology,  Cheminformatics  and  Bioinformatics, 
held  in  Duluth,  MN  and  organized  jointly  by  NRRI  and  Visva  Bharati  University,  India,  May  30-June  3, 
2000. 


New  Discoveries/  Inventions,  Patent  Disclosures 

1 .  We  fond  that  constituents  of  complex  of  mixtures  like  JP-8  can  be  clustered  into  different  structural  groups 
using  structure  spaces  derived  from  topological  indices  calculated  by  POLLY 

2.  An  in-depth  study  of  similarity  space  construction  and  analog  selection  resulted  in  the  discovery  that  for  a 
particular  set  of  compounds  the  degree  of  overlap  between  the  groups  of  analogs  selected  by  theoretical 
descriptor  spaces  is  relatively  high.  This  study  also  revealed  that  a  similarity  space  constructed  from 
physicochemical  property  data  provided  relatively  unique  sets  of  analogs  as  compared  to  those  selected 
from  the  theoretically-derived  similarity  spaces. 

3.  For  various  sets  of  toxicological  and  physicochemical  properties  the  topostructural  and  topochemical 
parameters  explain  most  of  the  variance  in  the  data;  the  addition  of  geometrical  and  quantum  chemical 
parameters  to  the  set  of  independent  variables  did  small  or  no  improvement  in  the  predicting  power  of 
models. 
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B.D.  Gute,  In:  Genetic  and  Evolutionary  Computation,  Eds.  W.  Banzhaf,  J.  Daida,  A.E.  Eiben, 
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1643-1651. 
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Topological  Indices  and  Related  Descriptors  in  QSAR  and  QSPR,  Eds.  J.  Devillers  and  A.T. 
Balaban,  Gordon  and  Breach  Science  Publishers,  Amsterdam,  1999,  p  563-593. 

Normal  boiling  points  of  1,o-alkanedinitriles:  The  highest  increment  in  a  homologous  series,  A.T. 
Balaban,  S.C.  Basak  and  D.  Mills,  J.  Chem.  Inf.  Comput.  Sci.,  39,  769-774, 1999. 

A  comparative  QSAR  study  of  benzamidines  complement-inhibitory  activity  and  benzene 
derivatives  acute  toxicity,  S.C.  Basak,  B.D.  Gute,  B.  Lucic,  S.  Nikolic  and  N.  Trinajstic, 

Computers  &  Chemistry,  24, 181-191,  2000. 

Construction  of  high-quality  structure-property-activity  regressions:  The  boiling  points  of  sulfides, 
M.  Randic  and  S.  C.  Basak,  J.  Chem.  Inf.  Comput.  Sci.,  40,  899-905,  2000. 

Multiple  regression  analysis  with  optimal  molecular  descriptors,  M.  Randic  and  S.C.  Basak,  SAR 
QSAR  Environ.  Res.,  11, 1-23,  2000. 

On  3-D  graphical  representation  of  DNA  primary  sequences  and  their  numerical  characterization, 
M.  Randic,  M.  Vracko,  A.  Nandy  and  S.  C.  Basak,  J.  Comput.  Chem.,  40, 1235-1244,  2000. 

QSPR  modeling:  Graph  connectivity  indices  versus  line  graph  connectivity  indices,  S.  C.  Basak, 
S.  Nikolic,  N.  Trinajstic,  D.  Amicand  D.  Beslo,  J.  Chem.  Inf  Comput.  Sci.,  40,  927-933,  2000. 

Simple  numerical  descriptor  for  quantifying  effect  of  toxic  substances  on  DNA  sequences,  A. 
Nandy  and  S.  C.  Basak,  J.  Chem.  Inf  Comput.  Sci.,  40,  915-919,  2000. 

Topological  indices:  Their  nature  and  mutual  relatedness,  S.  C.  Basak,  A.  T.  Balaban,  G.  D. 
Grunwald  and  B.  D.  Gute,  J.  Chem.  Inf  Comput.  Sci.,  40,  891-898,  2000. 

Use  of  graph  invariants  in  QMSA  and  predictive  toxicology,  S.C.  Basak  and  B.D.  Gute,  In 
Discrete  Mathematical  Chemistry,  Eds.  P.  Hansen,  P.  Fowler,  M.  Zheng,  DIMACS  Series  51, 
American  Mathematical  Society:  Providence,  Rhode  Island,  2000,  p  9-24. 

Use  of  statistical  and  neural  net  approaches  in  predicting  toxicity  of  chemicals,  S.  C.  Basak,  G. 

D.  Grunwald,  B.  D.  Gute,  K.  Balasubramanian  and  D.  Opitz,  J.  Chem.  Inf  Comput.  Sci.,  40,  885- 
890,  2000. 
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Appendix  1.1  A  hierarchical  approach  to  the  development  of 
QSAR  models  using  topological,  geometrical... 


15.  A  HIERARCHICAL  APPROACH 


A 

o 

H 

£ 

W 


o 

£ 


f-h  ^ 

CL  Z) 

o  * 


W 

> 

H 

Q 

W 

U3 

H 

O 

H 


w 

h^ 

w 

Q 

O 


A 

◄ 

co 


< 

U 

HH 

A 

H 

W 

§ 

O 

w 

o 

c\ 

D 

< 

u 

hH 

o 

o 

h-l 

o 

Ah 

o 


^rxi  >>' 

<L>  3-> 


w  «  «  y 
<U  X3  O  N 


an 


h-i 

< 

u 


w 

ffi 

u 


p 

H 

£ 

◄ 

o 

Q 

55 


2 

£ 

c 

3 


3 

Q 


<D 

a 

a 


sill's §>p£|-§£ 

r-  (L)  ■  —  U-io'3^eiHC<D1> 

•s  X)  co  ^  -g  -a  b  .S  ^  —  C  *o 


CO 


*-«  c 

S.2 

~  Cd 


<D 

-  g 


4;  ^  -O  u 

°8|5 

S  '£.  ®  O  "  T3 
•"  p  *r  S3  >.  c 


(2  .a  « 

_  «  3  &1  “ 
2  -y  o  S  -  « 


2  S  2  o 

3  o  cx.y 

«*  S  *c  £ 


cd 


£  O 


o  >> 


CO 

A 

W 

H 

W 

S 

◄ 

A 

< 

A 


o 

d 

XS 

C 


3 

u 

d 

CQ 

X* 

es 

VI 

cd 


_  < 

55  oo 

fe  P 
> 

*3  ~ 

ID  oo 

-  £ 

<D  OO 

3  2; 

4-*  £— * 


HH  .X3 

-A  3 
O 


H.3  jg  fi.g  . 

P  W)  Kl  t)  W  , 

™  ‘6  2  a  O  ^ 

5  ^  MS  >,MO 

cd  <u_  r*  <d  ^  e 

^  t:  o  ^  u  «>.ti  o  £ 

“On«!G£«  —  0?C*C 

P  -O  g.  g-  s  .2  .2  3  0.5  0. 

s|l|p  l|& 

y  °  B  s?  t2  &  3  ^  5 

■=■ ~ p  -  -  ?  s-sg 


T  y  m  fj  I 

H  «-*  «?  £  3 


*>  o  jc 

•  3  •£  O 
£  .A  p 

8/  M 
i?  'O 


<D 


^  2  s  c  c ' 


CL  <u  w  • 

S2SC826 
o.  £go°  — 

-  "*  8  •“  <2  2  g 

.§■8  "l'g  8- 

2P  o-73  o  o.  2 
£  o  3  ^  ^  o  a? 

■a II  2  85- 

O  ^  n  CX  u  ..  U 

•§  «  «  2  «  J  o 

§■«  S2  &•§ 


P 

Q 


D,  o  ^ 

|  OJ  >  w 

u  n’O  u 

si«2 

Ilo-2 


>83| 

^5  3  O 


CO 

(D  ^ 
O  ~ 
—  V 


tt 
d 


cd^  ^  «5  hi  cd  ^  *-< 

C  «  S'0  3  O  « 

12  O  §  g  ElsP 

c  ij  is  >  t , , 

wt3  ««  ^  _  O 
i-i  t3  __  e  -t-»  c 
’-Wcdcdc.pSw 

..  ^  >  .§  ^  §  :>  1 3 

rtj  .y  .£  W  £  OD  O  (O  Qj 

"*  ^  o  -g  O  g_g 

10  S<o  oJ,8 

'  O 


<L> 


3  B  £ 


IZ) 


<D 

O 

Ut 

3 

O 

CO 

Pi 

fi 

3 

3 


<D 


s 

3<a 
gu 


S  x> 
.2  § 
o  -o  -g 
c  o 
> 


T> 

G 


4> 


GO  co  - 

>>_  hH  CO 

c  cd  H  u  b 
o  .  >  i2  „ 

'C  W  a>  3  co  .3 

£  2  S  "  &S,M 

p  ao  o  ^ g 

-  S  "6  ’§  ^  B  -S  o| 

.a  a  I  o  S8  >  <3  3  §  T ? 

*-»,»-<  cd  u  cd’rt  <-•  cd  ^  ^  H 

10  ”  H  f>  b  »  £  -3  .a  „ 

US  .  -n  S. 


-  «  .H  >j  £  “  -a 

g^ogo 

•|  .  e  -a  e  -  *s 

$2oo  otd 

•gg-&jig.S£ 

-8  -j  >*y  .y 

pG  u  *3  <u  c  > 

AO  s  3  o  =3  > 

3  .  O  f-i  co 

PSii.c 

C33 
"*  G 


„  co 
C  VO 
co  Cl 

3  on 

g  s 

Q  “. 

.  CO 

TJ 
w  C 
0$  jz 
CL  t- 
^  £ 
o>« 

1Z 
<*  £ 
QC  H 

5  £ 

Oi  £ 

jj.  CO 

S2  3 
S  A 

■§■8 
cj  C 

£5  o 
^  o 
Q  oo 


cn 

r- 

NO 


.S--S 


U  ,;  o  w  ?  JC 

'so  8  «  «  g  -° 
0355;$ 

8 1  o  ^  •-  2 


cd 


^  ,3  G  Cd 
C  t_r  -C  O 


b?^*: 


<U 


CD 


|  §  S  ~  2  »  '5> 

E  »  3  g  Of.  O 

_§  L-*  O  *-3  CL  CD 

•Sopfle  •g’jg 
^  w  a>  o  ^ 

«  w  c  n  m 
^  ^  .3  y  O 
w  X  g 
3 


o 


-  O  U 


S^lc-lo 

S  .£  £ 


(D 


CL  CJ 
cd  xJ 


CL  §  2  '  ~ 

<D  S.pG  c  X3 

l-i  cr  co  .3  ■*— • 


J  «  _  . 

«  S2  O  a 


o 

CL 

O 


~  CD 
O  -G 


cd 

<D  (U 


®  >  o.  §  c'.a  53 

2^  &g.|Sfc 


u-  CO 


i  tj  cd  cr'co 


~  JC 
Li  O 
w 

-2  P 
CQ 

•g  c 
g  « 

S  o 
-a 

■§  o 
£  a 


3  s 


r 


X 

to 

O) 


<  = 

00 

O' -3 

—  O 
.8  '§> 
o  o 

U*  ■  ~ 

3 

»_ 

<D  — T  ♦ 
■p  3 

45  .y  13 

c  <J 

P  P  *3 

o  ^  E 

2  j£ 
o  O  'g 

C/5  .O  ^ 

Pi  w  o 

^  r  i/i 

C/5  I  1  I  f“\ 

a  &  §■ 

a  *3  2 

•5  w> 
c  ^ 
£  2  c 

(D  w 

>  .a 
a  *8  - 

1 

£  CJ  O 
^  B 

5  p'S 

cj 

bQ  Cu 
a  cp  o 

3  —  p 

jp  .5  cx 
«  ca  •=; 

W  U  O 

«£  a*5b 
^  o 

C  oo  »-r 

"I  8 

3  § 

c o  +-» 


cj 

Uh 

.<d 


CJ 


CO 

Cri 

W 

H 

W 

S 

•s< 

C2 

< 

.a, 

fci 

O 

z 

o 

H 

< 

3 

U 

u 


Q  m  a  o 


i 


£ 

VC 


CJ  c 

•5  o 

C  3 

■“  E 

c  o 
“c 

flj  »-» 
•8  ° 


X  jo 


ca  x 

„  3 


_ r  o 

ca  jp 
o _ 

£•*5 

<o  > 

JC 

U  c/5 

o  JL> 


.  i>%  O 

:-g.P 


c< 

C 

o 

C 

CJ 

p  o 
CJ  Jp 

tJD 

b 

£X 

co 

JJ 

u* 

£H 
o  ^ 

.£ 

c— . 
O 

O) 

o 

’o 

O  (L) 

ca 

to 

CO 

£ 

CJ 

s: 

53 

£ 

*a 

X 

co  B 

CO 

to 

<< 

o 

D 

P  2 

D 

c 

ca 

*c 

O 

C 

8 

co 

OJ 

•5 

CO 

3 

.2 

*n 

o  < 
.  co 
.3  CX 

X 

CJ 

CO 

j= 

X 

2 

3 

> 

CJ 

CO 

u 

*3  *3 

p  .2 

p 

t— , 

o 

•§• 

h 

2 

H 

Vh 

3 

CO 

CJ 

3 

Q 

W) 

c 

o 

to 

8 

^  3 
co  .§ 

co 

D 

O 

u. 

CJ 

CX 

1 

E 

3 

X 

c 

CO 

8  -g 

to  C 

X 

.£ 

o 

!— 

a 

JP 

05 

*3 

CJ 

<D 

a 

15  e 

PP  o 

<— 

o 

"2 

*3 

CJ) 

£ 

CJ 

X 

?  s 

CO 

*£b 

.=  —  C 

*  8_g 

«  §"o 


O  ^-s 

C  £ 

©  X) 
X5  C 
_r- 


■§1 
ca  E 


g  3  > 
?  i;  w 

^  O  ^  w) 

«  !•'  g  § 
s  JS  c  ’g 
2  °  S  > 

ca  .«2  *£ 
bO«g 

c  ~  £  w 

**5  2 


ca 

w> 

<U 


3  X 
O  CJ 

_  (D  *ob  c 
c  «-  O  <D 

crt  <D  —  V) 

2  r*  O  <D 

3  r  .-  in 

cr  ^  jd  a 


ca 

o 


a  X 

‘2  c 

•s  « 

c  C/D 
O  CJ 


CJ 

Ui 

O 

rT 

CO 

.2 

1 

L_ 

w 

c 

c 

u 

3 

to 

.2 

no 

E 

1 — ’ 

J- 

CJ 

CJ 

X 

0 

u 

O 

u, 

CJ 

t-H 

CJ 

a 

to 

X 

a 

JO 

> 

CJ 

Jp 

u, 

0 

C 

O 

to 

E 

3 

JP 

PCX 


ca 

N 

ca 

jp 

X 

G 

ca 

c 

to 


E 

<d 

Jp 

o 

o 

*co 

>> 

jp 

a 


>%  u. 


a 

o 


W  <L>  CX  o 


>,  cx 


:p  > 


C/5 


CJ 

X 

bQ 

P 


•O  C  P3 


ca 

o 

P 

<D 

o 

ca 

s 

V_ 

ca 

JP 

a 


ca 

o 

‘5b 

o 


z 
o 

t— 

H 
U 
D 
Q 
O 
CC  s 

H  2  -g  .2 

/  j:  ojc 

-  <  o  o 


o 

o 

ca 

E 

U  r 

ca  * 
jp  • 

a 

x>  - 

c 

ca 


ca 
o 
W  *Sa 

f  ^ 

fi  a 

2  X3 
U  <D 

3  C/5 

S  P 

W  c/5 

Vh 

<u  a 
>  2 
ts  ° 

.PI  X) 

P  c 
E 

ca 

CT  o 

C/5 

2?  g 

J— «  HpC 

ca  »-U 
m 

m 
« 

•5 
.  p 


c  — 

.2  S 

t>  ‘5b , 


<u  o 

fc.a 

X 

(U  o 

Jp  ^ 
^  X3 

t_  c 
<2  « 
co  ca 

B  -S 

«  O 

C/5  XJ 

G  2 

o  £ 
u  o 

G  ^ 

0>  _ r 

2  ^ 
•P  o 

Ji  s 

3  ^ 

C/5 

f  5 


C  <L) 
<U  XJ 

•g  C 

£  p 

C/5 

'  ca 
G  O 

°  E 
—  0) 
p  Jp 
U-,  CJ 

<u 


ca 

o 

E 

<u 

JP 

u 


c 
o 
o 

E  c  W 


o  £  -C 


— .  C/5.3 


o 

ca  § 

.E 

E  £ 

o>  *5 
jp  ^ 

CJ  x 

o  5 

.a  a. 

c^  Pr 

•5  o 
cx  o 


3  ca  Cjo 


•-  2 
C  CD 

.2  5 

’flj  ca 
P  > 
P  ca 

8  £> 

>, 55 

S  o 

CX  w 

o  .2 
p.-g 


1 

>>  cx 
ti  u. 
X)  o 

CX  <P. 

2b 

CX  03 

C/5 

r-  00 
*2 

2  ° 
P  CJ 

oo  p 


^x  -o 

2  E 
w— .  o 
c  ° 
o  o 
*c3  ca 
ca  p 

E  5 

W  > 


ig  ca 


w  o 
XJ  JZ 
p  ca 
S  ^ 

N  i_ 

ca  — " 

JP  CD 

^  oo 

CD  p 

cx  J2 
2  « 
a 

P  T3 

2  o 

u 

X)  CX 


3  3 

?  c 
O 


ca  <d 
ca  2 

X3 

_ _  u 

2  2 
c 

<D  X3 

E  8 

5  c 
x  w 


ca  <i> 
.2  c 

c  O 


*i  Z 
•-  P 

^  2 
5-  P 

flj  1— 
O  CO 

E  u 
2 

cx  o 
1> 


<D  X  P 

.C  <D  2 

o  J2  •£ 
ca 

CD  ^ 

W  M  ^ 

8  ■«  o 

p  O  C 
JP  CD 
r  ^  u. 
'tr  CD  CD 
°  C  JP 
w  C  H 

o  .S  * 
S  .y  J 

-v  "2  ca 

t£)  W.  3 

.E 

c  X  ^ 

<D  C 
p  ca  .£ 
o  ca  *£ 


p  jp 


<L>  V3 
<D 


ca  _ 
u-  x  P 
o  — 

SJ! 
E  3 
o 
c 


X  X 
CD  CJ) 

>  p 
ca  i- 
jp  x 


C/5 

o  2 

E  g 
^  S'o 

CJ  to 

CJ 

U-  JP  sX 
o  CX  ^ 


<D 

.2 

x 

.E 

CD 

cx 

ca 

jp 


c2 

< 

oo 

a 


co  _  ^ 
rt  O 

2  *r  E 


jp 
o 
k  ca 

o 


CD 

JO 

c 

ca 

o 

jp 

o 


CX  <u 

^  2 
CD  *-* 

O  £ 
p  o 

to  0-. 

c 

Jo  -g 

p  2 

^  p 
co  — 

X  P 
o  o 

-C 

o 

E  „ 
cC  ^ 
<  g 

oo  5 
O'  ° 

«—  ca 


ca  ^ 

«  13 

2  cj 

2  2 
u 

c 

*2  2 
2  j£ 


5/5 

l-  > 
OJ  <D 
CJ  P 

£  ca 
ca  ^ 

ca  g 
a  £ 


2  E 
H  “• 


o  ^ 

E  S 

CX  <D 

-2  E 

cj  ca 
>•  *-* 
cj  ca 
X  cx 


ca  (D 

O  > 
<D 

-  X 

°  « 

E  -a 

P  CD 

c  > 
CT  S 

to  CD 

03  B 


o 
^00 
ca 

CD 

ca 
JP  3 


cd  __r 

^  2 

JP  P 

<L>  2 

^  S 

co 

^X  O 
On  CX 
— 1  O 

1  «- 
Tf  r 

»— i  .N 
1 — 1  > 

2  co" 

P  Ui 

g  OJ 

o  I 

co  fj 

H 

U  ca 
D  CX 

3  'O 

a  2 
E  =* 
o  ^ 
«  E 

c  8 
*55  o 


<D  ■ 
2  ^ 
c^, 

—  t 

-  ^  ^ 
CO  T'3 
l-<  ‘t*  C 
(D  ^ 

2  jp 

®  s 

o 


^  ca  ca  — 


3 

.H  -3 

E  .a 

JS-S 
°  53 
E  a 

P 
3 
P 


3 

E 

’G 

CO 

CD 

<D 

JP 


CO 
CD 

CJ  co 
•  CO 

3 

O  P 
P  P 
u.  P 
«  .O 


cr  .p 

CO 

X  P 

p  CO 

p  15 
-  X 

8  i 

c2 

< 


p  *p 


-2  x  £ 
n  cj  ^ 


£  ’g 

cP  3 

«  a 

3  co 

5  co 
a  <d 

E  8 

8  g 


CJ 

£ 

5  oo 
CJ  CX 
W)c_ 

o 

3  <p 
O  p 
3  cj 

E  £ 

«  Q, 
O 


P 

cr 

CJ 

JD 


CO 


Jp 
O 
O  CD 

a  > 
2  -a 


a.  .32 
a  •£ 
ca  <d 
co  a 
2  £ 
H  a 


Table  I  Symbols,  definitions  and  classifications  of  topostructural,  topochem-  Tab|e  ( (Contjnuedj 

ical,  geometrical  and  quantum  chemical  descriptors  — - - 


X- 

to 

43 

VO 

II 

VO 

1 

m 

VO 

.> 

:g 

| 

1 

V- 

XJ 

ll 

CO 

43 

a 

03 

o 

II 

It 

X) 

bO 

u. 

II 

II 

w, 

43 

( 

Uc 

43 

c 

u. 

O 
c— , 

G 

O 

C 

^43 

u, 

X) 

43 

O 

ij 

o3 

43 

XJ 

w, 

w. 

O 

XJ 

u 

/-s 

X 

43 

CO 

43 

C3 

^43 

> 

O 

o 

44-. 

4J 

XJ 

CL, 

43 

C3 

4-, 

o 

o 

X 

C+_i 

O 

g 

4—> 

43 

> 

43 

> 

X 

43 

X 

X 

XJ 

*cd 

XJ 

43 

■*—> 

cd 

Oj 

iH 

43 

X3 

c 

XJ 

.s 

‘> 

o 

XD 

8 

43 

u, 

G 

o 

43 

a 

c 

G 

X3  XJ  XJ 
43  43  <L) 

CO  CO  CO 

a  a 

x>  x>  JO 

xxx 

43  43  <D 

X)  T3  T3 

■  S  .5  .5 
^ 


g  g  g 

cd  cJ  cd 
*■0X3X3 
^  ^03  J3 
cd  cd  cc 
PQ  PQ  PQ 


43  .y  <- 

c  ^  S 

|  o  8  fe 

0'S  cj 

>  £  cd  £  x 

co  ^  to  2 

C  ^  C  03 

cd  in  u  c 

>  43  o  43  G 

X  g  ^  2  4) 

•S  £  I  £  I 

C  Q  8q.2 
Cd  I  CJ)7  *0 

>  m  m 


-e  o  o . 

O  ^  u 

U  o  JS 
£  x  . 

g  x  Si . 
o  9-  £ 

£  g  XJ 
T3  O  -9 

,U  w  a 

a  «  ^  ■ 

«  o  u 
g  JC  O 

g  to  o 

o  r  C 

«  3  ' 
w  XJ  ^  1 
a>  c  ^ 
-G  o  g  i 

a  s  ©  ; 

43  43  43  < 

X!  J 


c3  P 

S>  c  u 
-c  g  g 


^3  ^ 

W)  60  60 

W-4  u«*  i_< 

43  4)  43 

a  a  a 
WWW 


M  °  -H 

iZ  o 

gS.fr 

W  SC  Q 


XXX  X®  x  >r 
=  -=  -«  c  ^  ■-* 


I 

Cb 


2  9  2  0  0 

c  S  S  s  s 

§  2  o  O 

2  HC  X  _)  _J 

O)  k}  tq  tq  cq  <3  ^ 


r3  'to' 

L  U  D 

2  a  .a 
y  £  r 

r*  co  43 

H  ^  > 

CO  v  " 

I  c  H 

o—o 

w  43 
..  u  cd 
w  cd  ^ 

03  «  O 

:.3g 

o  *o  o  ' 


■°  £  ^  1 
'  g  g  T3 

'  £  2  c 

w  d  «  * 

2  t/3  X> 

g  O  O 

,h  a  c 
o  u 

T3  r°  O 

43  H  03 

g  « =5*; 

.2  «>  cd  ' 

.ti  .a  D  - 

r  *0  J3 
os  .5  ~ 

3 

a  .§  5 ' 

^  g  03 
CO  <L>  C  , 
H  o.2 

^  cL  2  ■« 

.  °  o  6  1 

->.  4_>  -4— ► 

u  m  X  tP  1 

®  3  ^3  ’ 

si 

§  H  o  -g  ; 

2  TD  g  1 

3  .Sg.! 


>  co  •— ■ 

:l|o 

!  £  !  & 
i  O  -g  o 

I  s  §<ja 
i  £  H  g) 

C  c/i  •*-! 

:  .2  6^ 
,t{  O  cd 

N  d  ff 
t  S  c 

£-3 -2 

x  >  j 

a!  s 

£2  -S  c 

<  O  a,  ^ 


-  C_  t_  3 

s  b  o  cd 
i  O  g  p 

!  60’S  ^ 
f\g  «  -G 

:  -3  y  .a 

^  o  ia  > 
^  5  » 

(D  »-< 

x  o 
^  ^  g 

C  O  r-* 

•53  o  E 

TJ  c*-i  cd 

<D  O  b 

>  u  5 

’o  <D 

1  E  S 
-  g  S 
s  ° 

o  ^  ^ 
^  c  g 
od  o3  .5 


;  Z  X>  JC  C 
>  £  2  --  > 

2  jx  ^  o 
g  60  xr 

'lip 

<-  p  a  "* 

°  5  o  8 
'  8  i  5  -3 

t-g  >>C 

«  S  U-H 
CX  <U  (U 

S  b  g-5 

CL.  TD  u.  tu_ 

xd  d  o 
.§  «  .52  2 

gS?“ 

34Si8 

o  c  rt  x 
c  2!  H1 

*5  03  X  • 

«  o  o  ^ 

cx*c  g 

W  C  o  L 

S  jp  >  ci, 

?  o  •§  2 

>  D.  cd  a 

>  o  _ 
a  H  g  g 
^JL)  fc 

I  8  *  D. 

*Z  •— *  co  ^-- 

td  o  j:  x 
o  6  g-.a 

>»  J-  2 
•5  M  Jg  . 

•s  .5  is  o  « 

O  .S3  -S  -a  ^ 

U|  5  uj  o 

g  &  8  S  £ 

o  §  o  jj  H 


u  tr  u 

o  O  c 

o  t 


’XX  d>  i 
T3  -O  , 
^  p  i 

O  X 

w  &'S  .! 

O  2  GJO  n 

•g  &  3  «, 

2  6  i 

U's^ 

Mi: 

*5  g  x 

•«-»  O  (L> 
v-.  >  X3 
fO  C 
C4_  O  ^ 

«  2  §  , 

I H « ’ 

gjj  e. 

.2  2  c2 

"w  C/3  (— i  * 

ra  oo 

go 

Sag 

^  C  Xd  (D  . 

«  G  ca  *5  ; 

i-  *~h  *<  i 


O  *X3  VO 
u.  <D  j 
«  o  O 

Ci— i  <D  it 

o  e  H 
-  c  *«: 
E  o  v. 

3  <D 


Ep  S 

o3  W)  -q 

|S-S 


u  d)  n 
W)  >  o 

^  o  c 


&o  a>  a 

a 

<-°c£ 


vo 

I 

VO  II 
I  *>  <3 
^  I  ^ 

»  7  4? 
«  II  V, . 

^  •«  o : 

<L>  I-,  c_ 

"2  -S  ©  < 

o  i-  X 
<-  °  “  . 
0<o  c. 

X  -^ 

<L>  X 
T3  <D  w 

til; 

>%  o 
♦S  >>  q> 

.  >  .ts  a 

w  >  {X  . 

o  w  o 

<D  O  O 
C  <L>  . 

c  C  S' 

“  si 

»  G  V 

s’i  -s 

»  ■  ;  ctd 

U  U  a, 


43  03  «- 

>  xj  .5 


S  w  o 

60X  g  \C2 

g  g  t30  fe 
x  j:  cd  > 

l-t  t.  .  4-1 
JCJ  O  o  ° 

it.X'l 

vgn! 

^  S  a  M 

*X  P  5  JG 

^0  6  o  .5? 
cx‘x  k  2 

g  a  °  c 
oES| 


g  «  G 
^  co  43  ’O 

°:§^ 
g  ^  u  (o 

d)  JC  G  I 

x  ao° 
o  2?  II 
o  60  g  ^ 

g  XI  E  -  1 

•z:  «  <ic  t! 

2  13  -S  43 

g  D.  rd  •£ 
£  §•  8  c 

C  V3  Jg  O 


^  W> 

xd  VO  ^3 

a  l  5 
cd  °  a 

s  bO  ||  *r 

T)  v  C  v 

o  w  a 
s£  o' 
,  c  ^  •o  J 
S)£  w  , 

2  t  *  i 

•p  «  c  ! 

:  ^  o  ‘Z  «- 

,  ^  O  03  ' 

i  03  C3  O 

:  £  c  ’-g  i 

!  O  S'1 

co  • <3  > 

1  43  XI  • 

1  u  2  k 

;  '5  I  °  : 

!  >  £  o  ■. 

!  g  2 

:  o  xd 

J  'X3  u?  o 
;  o  &  x> 

:  o  ^x 

:  -g  s  .sp. 

5  o  E  g  ■ 

^  xd  ^  g  . 

|5gx 

:  43  k  v- 

3  C  U  °  I 


VO 

l 

vo  I' 

VO  I  'O  ^ 

I  m  I  u. 

O  II  *?•§ 

II  <  j.  s 

-^3  *-  r 

^  43  ^  ‘t; 

*-<  T3  43  O 
-2  k  X 

X  o  4.  ^ 


<o 

o 


^  o 


jz  U 
U  O  QU 

X  X  X  X 


O  OQ 

^4  a:  r  b 

u 

V. 

U 

JS  U 
joOX,U-ocu 

w  ^  £  a 

HH 

00 

M 

O 

^x^x 

o  E 
o 


rd 

L. 

<L> 

JZ 

c 


<D 

w> 

3 


O 

£ 


4-»  O) 

*2  u 
<3  Xi 
„  c 


±S 

r 

2 

W) 

& 


U 

O 

Q. 

o 

4-» 

E 

o 


(L> 

L_  — 
Q.  *J 
<I)  C 
l~  <D 

u  E 

n  O 

E  a> 

3  > 

4  (U 

bO  X 
3  _ 

^  0) 
Q  -D 

o 

-  E 
P  or 

3  <C 
Li-  O' 


on 

X  <D 

O 

E 


< 


Q 

P 

H 

« 

Cu 

GO 

o; 

« 

◄ 

CO 

a 

p 

c 

u 

HH 

K 

u 

<3 

tf 

W 

S3 


c 

o 

Cu  o 

E  « 

3 

d  c 

CO  <D  ‘13 

O'  S3  -5 

“  »  O 

C  C  u 

a-S  9- 

■§!  .? 

^  £  o 

•*  «  s 

c  -c  O 

X  .  2 
<D  co  0 

C/3  <L>  *»-i 

3  *-3  XJ 

c  fc  _£ 

8  §•« 
x)  ui  w) 

co  cx.E 

2  ^  ^ 

x:  o 

X 

2«  E 

5  &  £  -g 

6  >  ^  -8 

•3  a  12  I 

rj  X  «-*  3 

,3  J>  C/3 


•C 

o 


^  CO  ;_> 

3  3  cz 
J3  MO  £> 
,S  >  3 

.  ■*_>  D  C*_ 

-C  .9  g.  o 

3  O 


<l> 


£5 

Ol 

X 


*< 

X 

to 

01 


S 

.5. 

I 

1 

.§ 

£ 


5 


I- 


a>  to 

CO  C/3 


<D 

£ 


s  a’s 

5  B  £ 

ai  O  t! 

“C  0  E 

§  ^  £ 
^  o  <l> 

g  to  g 

<D  ■*-»  > 

n-j  <U 

a  “  S 
g.  u,  .y 

6  j2  •o 
•2  13  -S 
•-  E  75 

4-.  00  o 
o  -3 
.  ^  E 

n>  P  <D 
XJ 

Boo 

3  ,rH  Oh 

e  a  2 

« s| 

IS  g 

3  2 

E  o  o 

O  3 

0  _  d 

H  o  to 
X  o 

o  cx 
c  o 


in 

<D 

J-r 

2 

E 

co 

3 

CJ 

CO 

"5 

3 

4> 

3 

.E 

0 

<D 

Ut 

co 

X 

t-H 

x: 

3 

X) 

3 

co 

w 

0 

0 

CO 

<D 

O 

P 

O 

CO 

H 

W) 

3 

O 

3 

u. 

O 

l— < 
Oh 

X 

3 

*3 

3 

3 

0 

r- 

0 

w 

CO 

•  S 

CO 

4-1 

.2 

uT 

2 

0 

co 

£ 

0 

Oh 

X 

< 

O 

3 

CO 

V 

O 

O 

(U 

X 

!E 

CO 

<D 

XJ 

13 

00 

<D 

<L> 

E 

x 

2 

0 

0 

4. 

t- 

<D 

JL> 

x> 

3 

t) 

£ 

^  XJ 

5P  £ 

3  d 
co  X 

p  !> 
X  X 
<D 


.a  -a 
o  w 

3  XJ 


a 

E 

c 

.2 

•4— » 

i3 

Lh 

u 
O 
o 

o 
J=! 

-4-J 

3 
O 

X) 

(L) 

00  J3T 
cd  99 

x>  999 


3 

x 

<u 

o 

O 

t_ 

a 

co 

P 

P 

U 

< 

> 

(D 

H 


<u 

3  w 

(U  3 


(30 


-  co 

p 

p 

U 

P< 

< 

> 


3  g 

o 

■5  H 


C/3  o 

o 

t— 1  w 
(O  in 

to  O 
^3  £ 
o  x 

rl  4) 

1l 

^  <U 
3  £ 
-3  w 
^  O 

O  w 

00  <L> 

Lh  (D 

£  <L> 

13  to 

^  3 
.£  o 
o 

0?  O 
ir;  3 
XJ  <u 


x  o 

£  2 

^5  00 

<L>  3 

t  0 

O  ° 
°  O 
>,  X 

T=  g 

§ s 

a.  c 


rt  8 
J2  3 

’  H  C 

XI  O 
^  £ 
S  jj 

o  b 

1  U 

3  o 
■"  o 

3  y 
3  .3 


O 

& 

u, 

a 

■4—1 

3 

(D 

X 

3 

o 

Oh 

<D 

X 

3 

X 

<D 

E 

u> 

c2  w5 

^  VD 

O  x> 
3 
00 


C/3 


3 

O 


E 

c  js 

O  o 
c  o 
.2  & 

X 
1>  3 
co  cd 


o 

o 

CD 


O  C  7± 

w  D  * 

D  *>  SC 

c/3  *£  *  — 

rs  < 


;uiod  Bujijog  ibuijon  pejBUMjsg 


•g  *  & 
ao  ° 

O  5/3  /JT 

ah? 
^  w  X 
<u  *-S  — 

XH  r 
= 

<  ^ 


3  33  to  c 

)  W)  £  CD 

3  '2  »*“  > 

j  ^  W 

j  7d  ^  >? 

!  § 

>  .2  &  in 

|  a  °  2 

J  D  -  n. 

'  ^  —  O 
,  ^  m  p, 
1  <U  "O  -3T 

?£?g  « 

1  f  3 

Cd  r-  D 

>33  .E  -a 
P  w  o  o 
i  -g  °*  E 
'  S  oo 

!  g  .5  2 

:  CX-~  XJ 

i 

i  8  jg  g; 

C(_  -u  Tr 

O  w 

D  *5  U 

*-.  .ts  d 

3  >  £ 

2  on  >o 
c  <f  33 
O  ^  P 


i“i 

*£:1 

o  't'-O 

°-  ^"S 

c  w  “ 

So  s  °» 

cd  u 

T)  T)  >h 

C3  a  « 

^  in  (U  i 

^  73  g  1 

SES. 
cr  ex 
W  *o  o 
*d  a  ^ 

2  2  g  • 

o  c 

S  gj>  8) 

Id  o  £  ' 

2  gt, 

j3  .2  o 


!  .2  2  S 

'  |  £g * 
i !  -1  -M  .§ 

|  3  15  «>  "g 
o  o,-£  .5 

I*  “•  S.'s  2 

:  <u  :=  °  b 
l  X  1/5  5 

I  *“■  o  d  o 

.  7J  <£  >  3 

l  *55  DU 

i  O  c/3  * — 1  +-* 

:  >  2  o 

3  JJ  a 

1  o  5  ° 

■*—<?»■  4-»  ■+_» 


£  =3  g 

u  TD  .2 
^  cd  ^ 
w  W  T3 
p  33  TJ 
.2  ^  cd 

73  33 

't-*  <L> 

7 

S  T)  33 


5— i  n>  > — s 

•“  C  vo 
O  U 

g-  S  cr 

2  z  w 


7  ^  II 

+  Oi 

*2 

OJ  ^  -|T 
— /  I  \l 
CO  +  VO 

^  QO 

7£ii 

^  ™  . 
tZ,  o  os 
OO  J.  Op 

o  '  t*** 
—  7^0 

+  2.  II 

CN  £»<\. 

Tj*  ^ 

o 

|  |  Os 

II  S 


Eg^ 

»—t  u  77 

1  2  Z  ■ 
is®. 

o'  a  s  • 

c/5  f: 

d  o  J£ 

2  o  ■§ 
a>  «-*  5 

t7  •*— •  O 
53  U 
W).Op  g 

O  W  <D 
*-*  •  ja 

>,^2  H 

u  cd 
cd  u 


§  2  ^ 

^  *5  JO 

■£  ’o 

O  <L> 


"u 

cu 

X 

II 

u 

•-H 

oo 

SO 

3151 

o" 

o 

+ 

II 

U 

> 

1 

0 

Os 

rn 

-P 

X 

<N 

°o 

u 

-C 

u 

11 

O 

q” 

7f 

o 

o 

VO 

Os 

+ 

00 

o 

> 

II 

SO 

(N 

X 

'o 

Os 

oo 

+ 

O 

^  X)U 
+  _* 


I  s 

I I 
D  U 
toft  D 

a,  cn 

D  D 


>1  |-S 

n! 

s  “  *S 

D  ^  -2 
33  ^ 
'*”*  D  ro 

^  O  ^ 

i  s  w 

^  d 

*  — r  ^  o 

g  o  .  S 
C  P  o 

D  S  a 

•5  x>  g> 

.  T3  .S 
O  ^  ^ 
u  g  -O 

>  3  « 


O  3  C  3 
-O  ^ 

—j  33  73 


D  tt  <D 

ir  D 

3  ^  a 

.a  e  § 

T3  cd  g 

.  p  b  u 

u  cd  <D 

a  a  > 


oo  ^  < 
— '  >U 

O  w  x 
,  vo«* 

+  ^  cF^ 

X  Z:  c 
n  ,  I  °  < 


®  «  ><;  II 

I  y,T_mv. 

m  <N 
— '  o\  ^ 

o4  o  o  « 

i  i  +  ~ 

<N 


ri  _ .  w 

O  ^ 

5-h  <-«->  - — ; 

B  I  o  £ 

ON  ^  I  — 

2*^  II 

°  >  r*  L: 

^  O 

u  <0 

HH  4" 

^V"  +  o 

2**^U  11 
I 


+  U 

I  \ - ✓  N - /  ^ 

o  r- 

vo  h 

>n  6  6  on 
I  1  +  Jn 


cm  -n  cm  +rt 
in  <D  o 

C  £2  p  cm  - 

.S^rt  ■§  ; 

8  "S  >  «  ‘ 

a  u  *o  a 

m  «J  o’ 

.§1  8  2  2  ■ 
5  <£  -O  8  | 
g  §>  §  1  ^ 

151!  I; 

1  O  "  u  “  » 

iaS£|2| 

i  £  Hi  0  c  H 

!  u  £  -t=  '  =  © 

*  o  2  >»  g*. 

>  2  cm  43  -t? 

'  G  o  .*_*  cm  > 

;  e  1  •§  j « ■ 

;  «  >  §  1  I 

>  5  go  g  *r  c  < 

SC  —  g  s  Jc 

;  -o  «  2  §  c  ' 

!  «  g  |  g  » 

^E2Kg 
i  “  -c  <n  c  -~ 

:  «  a.J=  8  § 

*  «  0  .  s>  <-> 

-  2  a>  ^  >  md 

>  .S  J=J  g  r- 

-  o  ^ 

3  ‘C  ^  cr  ^  «  : 

;  «  -  w  «h 

\  6  o  c  °  M 

=•  8  °«s  S 

m  GO  c^J  c*_ 1  <4— • 


TS 

to 

O) 


•Si 


CN 

m 


r- 

II 


—  on 

VO 

00  . 


ON  - 

r*-  o 

NO 


a  * 
G 


bb  £ 
2  * 
£  <u 
§  C 
£  .2 
, — ,  u. 
cd  cd 

s  > 
!.§ 


O  O 

a 

O  V® 
+->  ox 

<D  ^ 

-C  r-' 
^  vo 


~  ctf  CO 

5  ^  *H 


S  .5 

^  Cd 

SP  Oh 
,S  X 
+3  <L) 

3  fH 

S  O 

£  £ 
<D  £ 
X) 

1? 

H 

00  Vw 
£  2 
O  <L> 
*&  £ 

S  «3 

^  Cu 

W  g 

o 


bog  a 

.£  gn  \Q 
w  <L> 

rr  <u  i-« 

^  2  En 

V-.  •*-'  ti£) 

•—  *£3  f  T  . 

u  u  ^ 
X5  •=  . 

o  *5  •*> 
c  £  X 
c  c  “ 

<D 

.-C  "  X 

•*-'  C - s  <u 

c  o  -o 
.£  ^  a 

u  “  .2 

co  ^ 


>a  T3 


00  £ 

<0  § 

o  Ss 


.2 

•§  8 

a 

„  73  2 

§|  « 
&  S  3 

21* 
£  2  *S 

J.8  3 

^  /-»  a 

U-i  <D 
T3  O  ^ 
O  G  00 

os  .2 
h  *2  x 

2  <i> 

C  nj  n 
1)  -O  c 
ox)  cd  .5 


g  ^ 

.6 
<L> 


•8 -a?t 

-C  ^ 

o  £  o 

E  o  2 

.S  0  o 

s  JS  -I 

•§£1 

&  0  £ 

u<  ^ 

:§$! 

^1| 

«*§  1 

e  —  g 

3  S  sc 

3  s 

5  3 

*-*  cd 


o  o 

c-  ,, 

■3  r:  o 


H 
*0 

§  °° 


•  £ 


o  >> 


o 

U,  •> 

c  e 
—  a 
•o 

o  X3 

^  <U 
C  JC 
O  co 
00 

2  -5 

a  3 
_  a 


—  ca  °°  £ 


c 

0) 

E 

G> 

a 

s  . 

o  2 

o  ^ 
a^ 


’cd  W3  O 

8  y  6 

'3b  ^  r: 

O  T3  £ 

.2  B  «2 

£  N  ^5 
o«« 

4-»  S?  O 
<D  <L) 

to  ^  *— < 

X3  o  O 

C  o 

o  , 

o  ^  E 

«  0  S 
■*— > 
CO 


CO 

^  ’w 
w  > 

O 

_,  <D 

£  Si 
2  *c 
5 

™  M  X) 
P  U,  3 

cr  c:  a 
a  0)  ._ 

0  ^  *S 
c  „ 

O  ^  v-. 

rt  «  ^ 
^  4>  T3 
R  ^  <u 

Id  ^  3 

<D  ^  o 
£  to  ^ 
<D  .£  i> 

X3  E  S 
g  §  -o 

£  to  o 
c*  o 
to  ^  O 

ja  8  * 


s  > 


4> 
£ 

w  o  *3 


W 

£ 

o 

til 


■+— > 

0  o 

CO 

H  cd 


+-• 
to  C 
cd  2 
^  £ 


§  s 

§■§ 
2  «a 

to  « 

<u  .£ 

to  cd 

g  *C 
-O  o 

4)  0) 

£  2 

M  to 


S  s  >>’ 


8.S 

JS  ^ 

4-J  JD 

•<*-<  *s 
0 1 
n  a 
23  S3 

0L>  O 

W)  ^2 

Wh  X) 

cS  C 


(p9lBUitJS3) 

uoi^BJiauad  jBUJjaQ  juaojad 


• 

u 

4-1 

*4J 

IO 

<d 

4) 

E 

2 

0 

i_ 

C 

td 

.2 

0 

’w 

cd 

0 

u 

X 

4— > 

O 

<11 

X 

c 

<D 

a 

0 

a 

_ 

0 

td 

sO 

E 

s_ 

L. 

2 

<11 

X3 

/-v 

w 

0 

c 

r*^ 

01 

if 

cr 

<11 

LU 

a 

b0 

c 

td 

4~> 

c 

*to 

3 

o  c 

E  o 


<u  2 
a  b 
x  <u 

<D  C 

O  8. 

o! 

g-i 

S  -o 
y  +-> 

g  C  to 
a>  c 

00  u  o 

so  a 
a)  xi  y 
£=  ai  P 

S)  «  ■? 

ir  £  x 


S 

O) 

Xt 

s: 

Q 

3 

o» 


O 


xt 

§ 


•5 

•5 

8 

I 


<§ 

VO 


X? 

<u 
r i 
E 


*-»  a 

<d  . 

-  t 

td 

xj- 

<1>  rt 

.11 

a  c 


in 


tn  o  m  y  io  <m 

o  O  T™  * 

(dBAd)oi6oj  |Biuauuuadx3 


o 

a 

nj 

> 

"rt 

£ 

i— 

o 

c 


c 

01 

E 

L. 

0) 

a 

$ 

rt  rt 

Z3  u 

2  C 

5  <u 
ul  _c 

<u  ^ 


^  > 
o  ^ 


m  £ 
u 

1/1  S' 


tn  b" 
<u  UJ 

3  E 
.bp  o 


688  Topological  Indices  and  Related  Descriptors  in  QSAR  and  QSPR  Hierarchical  QSAR  Models  689 


•-Y7  •. 


o  x  .e 
a  ^  x  * 

O  cd  c 

•*— *  ■  •— I 

X) 

bD  X 

CX  03 

o  , 

sis; 

x;  ^  o 

<D  +-*  Dh 

xi  o  o 

O  g  W 

g  S'- 
“go 

cd  a, 

5  2  o. 

.5  «  S  ' 

8  £  3: 

«-  8  «> 

°  J3 

cn  V,  H  • 


<D  « 
X  co  X 
+-1  <D  O 

^  £  S 

+-*  X 
*>  P  r/ 


»  3  & 

S2  .5  o 


_ . _ . _ . _  m 

0 

10 

CO 

in 

Tt 

in  • 

CJ 

d 

CO 

d 

0 

* 

d 

d 

0  Boi/i  papipajj 


cn  52  a> 

»>  r-J  4-* 

O  g  O  • 
x:  P  cd 

2  «-S5 

,  .H  U  , 

cr  x  o 

w  .s  s 


rs 


O  X  X 
X!  <l>  - - 
-*-*  x  x 

X.£g 
•r  <u  .2 

£  C  X 

o  c 
c  — 
o  .22  x 

.nr™ 

^  g-*  o 
«  h  x 
0)  .  c 

k  w 

^  CT  x 
o  c  o 
0^0 
hn  Cu 
50  *  o 

c  CT  w 

O  W 
is  w  o 
6/3  -S  c 
cd  c  .2 

X  o  x 
w  q,  x 
c  a 

£  O  _g 
P  o  x 

CX  <u  ^ 
,  50  03 

X  J- 

<D  Cd  .*— > 

X  ~“* 

C  <D  X 
co  t> 

.  P  > 


3  ^  w 

o  O  >» 

2  £r‘> 

S  i  1 

a  o  "J 

o  03  4> 
•*-*  X 
>>  ~ 
O  X 
bl)2  W 

.S  *J3  « 
1/1  2  o 
<  .S  E 


J5  £*  e 

t:  X  u 
■r  .>  o 
£  X  o 

-  2  *- 
>>  03  <D 

cd  u.  <d 

COX) 

’o  X  *EL 
x  .a  .a> 

2  ^  « 
£  O  cd 
(D  bO  x 

£  C  u 

X  x:  x 

^  <D  -C 

°  x  > 
h  °  2 
|  E  o. 

a.s  i 

s  o  S 

VS  3  « 

.8  2  {3 

•g  2  a 

8  S  o 
o.  E  c 
SE| 
-  g_  E 
O  o 

*-*  o  <u 
>,‘C  ^o 

x  «-»  aj 

C  8  s 

cd  c  O 
o  O  W 
C  o  rt 
*c  50  x: 

.£P  o  x 

•S  §  § 

x  .2  o 

cd  X  ^ 

w  x  «2 
O  x  5 

C  ri  f 


»  2  ^ 
.E  cd  x 
c  ^  > 

!i  ™  *5 

aC  k! 
X  •  x 
o  cr  5 

2  o  -g 

v _ ✓  X  ia 

♦  2  °« 

cr  c  co 

3 

Cd  CO 

X  c  o 

W  C  > 

c  -J  « 

:5b 

cd  ^  o 

S  2  6 

3  *3  2 

t  C/5  V-M 

i:  S  x 

to  u.  <u 
O  (u  ^ 

cux:  o 

5)  ai  a 
-C  ti  uT 
w  cd  ^ 

-o  ^  t: 
x  i:  2 

§  .E 

•c  (L>  J2 

w  f ,  cd 


‘5  >  .i> 

cd  a j  ft . 

£  5  c 
50c 

o  ON  X 
*4— »  OO  co 


r»  c 

3  2 

£L  8L 
3  5* 
§L  S 

«  o 
co  cr 
p  o 

*1  P 
CD  oo 

■a  -n 

to  E> 
co  3 
cd  o- 

3  ^ 
ST  ^ 

8.8- 

5’? 

H  o 

»  ft 

cr  co 

CD  o 
3 


?  3  a 

£  c  « 
3*  S’  ** 
2  OQ  < 
nj 

D  CD 
•n 

“  a-  «'  - 


P 

CTQ 

3 

a> 

3 


O 

C 

TJ 

co 

3* 

e 

CL 

cd 

O. 


*_,  CO  (T) 

O  po  X 

«  C  O  o 

^  W  r>  — 

a-  5*  5.  cl 
o’  3  ^  a. 
"*  o  £  g 

(D  D 
CT  3  0- 
a>  ^2.  n> 

CT  _  3 

p  o  3’  2 
3  O  & 
E.Q*2.o 

cr  p  c  p 

1-  S'  8*  5 

"  c  a  a 
H  a  "  o 

g*  S.  <-2  r» 

co  *  D 

°  5‘  oo  2. 
o  S 
sr  n>  o 

p>  5!  g,  £ 

££.  w  p 

3:  n>  q 

o  55.  o 

a  C  3.  p 

5'^  3-g 

3  3  3 

05  P  a 

3  S  -  3 
co  a.  g  co 
E-  ^  p  ■ 

c*  £ 

'L  «  S 

5-3  § 

O  TJ 


3 

a> 

’-i 

cd 

X> 

c 

3 

Cl 


2  3 
E/3  £L 
O  ^ 

73  2- 

„  a 

UJ  Tj 

^2  > 


3 
P 

5-  o 


3  TJ 
CD  -1 
CTQ  O 
p  < 

S' 

<  O) 

o- 

*“i  _ 

CD  X) 

2  C 
E  p 


p  tn  < 


>.- 


3  to 
O  cd 


3*  P 

CD  LA 


"S  ^ 


p 

CO 

CO 

O 
2. 
.  p* 


CD  _ 

- ^  O 

3  CD  CL 
Q .0  3^- 

3:  c  o  ^ 
o  r-  *■*  ~* 
cd  P  cr 

&n  M  P  nr 


3* 

CD 

X 

c 

p 

3 

«-*■ 

c 

3 

o 

cr 

CD 

g 

§. 

Xi 

p 

p 

3 


2  p 

3  ^ 

S  o 

CD  O 

p  5 

?  c 

3  o 
5  *-*■ 

O  CD 
C\  CL 
O 

3  O 
O  » 
P  _ 

•o  g 

p  3. 

2!  w 
CD  co 

o  a 

o  c 

p 

co  3 
2.  OQ 

f  3- 

3  CD 
OQ 

?  > 
o  M 
O  tj 

3  3 

o  ° 
1  8. 
5  c 

O*  «-i 
CO  CD 


« 

>2 
3  VJ 
3  CD 

2  8 

3  3 

3  § 
e  5 
g  w 

(TO  O 
CD  *-3 

3-  3 
2 .  c 

^  p 

P  CTQ 
co  « 
co  3 

£  o 


3-  2  o 


C^S" 

S’ 8. 

3*  ° 
3.  3 

§  p 

S  T3 
~  O 

g  2 


p  § 

3  I 

CD  O 

r— -i  C 

£  3 
^  CL 

CO 

Co  - 
1/3  K> 
P  On 
co  O 

§  I 

3  C 
8  ? 
o  <8 

P  3 

3  w 

3  P 
!-►  3 

»  O- 
00 

2  ^ 

3  Ch 

O*  ° 

>5'§ 

O-  3 
»  a 

P  c 

.  r+ 

H 

irr  oq 

CD  « 
to  3 
CD  co 

CL  ~ 

P  ^ 
03 

P  CO 


CD 

CL 


CD 

CL 

o 

>< 

o' 

o1 


oo  « 
as  g 

oj  O 
CL 
2- 
o 

~>  3 

g  ° 

CD 


C 

<  & 
P  3 

D.  oo 

3 

3  CD 
2  ° 
^  3 

Tj  E 
•3*  m 


cr 

CD 


OO 

3 

3 

CD 

*o 


ON 

VO 

o 

O 


o 

3 

3 

Cl 

co 

a4 

p 

CO 

CD 

CL 

O 

3 

5* 

CD 


O  p 


TJ  3* 

3  8 

8  i2 

3  2 
co  S 

3*  g* 

cd  cr. 

co  3 

S  t2. 

??  3 
CD  P 
»-i  3 

l-S 

o  ^ 
S. 

CD  — 
X  CD 
TJ  X 

CD  TJ 

3‘  » 

3 

CD  P 

3  cr. 
o 

3 


E.  3 


I1 


R' 


RL 


CL 

b 


§ 

o 


S* 

to 

>3 

§ 

cl 

to 

§ 


Table  II  Classification  results  for  520  mutagens/non-mutagens  from  DFA 

Model  Type 

Indices  Included 

%  Mutagens 
Correct 

%  Non-mutagens 
Correct 

Topostructural 

IC,  MU\, 

Xc>  6Xpc>  Pio 

76.2 

■  57.3 

Topostructural 
+  topochemical 

Xv,  3X^h)  6^h,  6XVPC>  Jx,  JB 

74.6 

63.1 

Topostructural 

4-  topochemical 
+  fragments 

//D>  ^>-2x,  Pi  0,  ic5,v,  \vh)  vPC,A 

nitroso  ,  mustard2,  sulf3,  benz4 

69.2 

71.9 

Topostructural 
+  topochemical 

4-  fragments 

4- geometrical 

Mu  \,  pl0,  IC5,  y,  3Xch-  6XpC.  jB’ 

mtroso',  mustard2,  sulf3,  benz4,  Vw 

71.5 

71.9 

Nitroso-compounds.  "Halogenated  substituted  mustard,  sulfur  mustard  or  oxygen  mustard.  3Organic  sulfates 
or  sulfonates.  Biphenyl  amine,  benzidine  or  4,4/-methyIenedianiline  derivatives. 
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ASSESSMENT  OF  THE  MUTAGENICITY 
OF  AROMATIC  AMINES  FROM  THEORETICAL 
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A  hierarchical  approach  has  been  used  in  this  paper  in  predicting  the  mutagenicity/non¬ 
mutagenicity  of  a  set  of  127  chemicals  from  their  molecular  descriptors.  The  set  of  descriptors 
consisted  of  topostructural  and  topochemical  parameters,  experimental  properties  like  log  Py 
and  quantum  chemical  indices  calculated  using  a  semi-empirical  method.  The  results  show  that 
a  combination  of  topostructural  and  topochemical  molecular  descriptors  explain  most  of  the 
variance  in  the  experimental  data.  The  addition  of  physical  properties  or  quantum  chemical 
parameters  did  not  make  any  significant  improvement  in  the  predictive  power  of  the  models. 


Keywords:  Aromatic  amines;  hierarchical  similarity;  mutagenicity;  quantum  chemical  descrip¬ 
tors;  topological  indices 


INTRODUCTION 

A  current  interest  in  the  fields  of  chemistry,  toxicology  and  biomedical 
sciences  is  the  prediction  of  the  property/activity  of  chemicals  from 
calculated  molecular  descriptors  [1-6].  In  both  environmental  hazard 
assessment  and  pharmaceutical  drug  design,  one  has  to  deal  with  thousands, 
sometimes  millions,  of  real  or  hypothetical  chemical  structures.  Most  of 
these  compounds  have  very  little  of  the  experimental  data  necessary  for  the 
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estimation  of  their  toxicity  or  efficacy.  In  this  age  of  combinatorial  chemistry 
one  can  synthesize  thousands  of  chemicals  very  quickly.  However,  expert’ 
mental  testing  of  these  large  numbers  of  chemicals  would  not  be  cost  effective. 

so,  It  is  possible  to  create  virtual  libraries  consisting  of  billions  of  structures 
In  this  case  one  would  like  to  know  the  toxic,  as  well  as  therapeutic,  potential 
of  such  a  vast  collection  of  chemicals.  The  experimental  data  necessary  for  the 
pred.ct.on  of  the  toxicity/activity  of  these  large  and  diverse  sets  of  chemicals 
Will  not  be  available  to  us  in  the  near  future. 

This  pervasive  lack  of  experimental  data  demonstrates  the  need  for  the 
development  of  pred'cti^  modeIs  based  on  parameters  that  can  be  ^ 

ted  directly  from  a  chemical’s  molecular  structure.  Recently,  our  research 
group  has  been  involved  in  the  development  of  a  hierarchical  approach 
o  quantitative  structure-activity  relationship  (QSAR)  model  development 
for  predicting  physicochemical,  toxicological  and  pharmacological  prop¬ 
er  les  of  chemicals  using  theoretical  molecular  descriptors  [3,  6-101.  Various 
topo  og,cal  indices  (TIs)  fall  in  this  category  of  molecular  descriptors 
11-23],  Balaban  has  classified  TIs  into  three  generations  based  on  whether 
ey  are  integers,  real  numbers  or  a  sequence  of  numbers  [24],  Different 
classes  of  TIs  quantify  various  aspects  of  molecular  structure.  We  have  shown 
m  the  past  that  various  indices,  viz.,  connectivity  indices  and  complexity 

diffemnt  t  Pr  ^  ^  ^  distinctly 

different  types  of  molecular  structural  information.  Such  indices  can  be  cal- 

,Very  rap'd.Iy-  °"  the  other  hand>  geometrical  and  quantum  chemical 
Pf  ,  7  e"C°de  informatlon  regarding  the  stereo-electronic  aspects 

of  molecules.  These  classes  of  parameters  are  also  algorithmically  derived 
/.e  they  can  be  calculated  for  any  real  or  hypothetical  molecular  structure 
without  any  input  of  experimental  data. 

One  of  our  recent  interests  has  been  to  test  the  relative  effectiveness  of  the 
four  classes  of  theoretical  molecular  descriptors  mentioned  above  in  the 

caTnT-^nw  Q1ARS  f°r  Predicting  Property/activity/toxicity  ofchemi- 
’  In  thls  paper  we  have  used  ^se  parameters  in  the  develop¬ 

ment  of  models  for  predicting  mutagenicity/non-mutagenicity  of  a  set  of  127 
aromatic  amines. 


METHODS 


Datasets 


A  set  of  127  aromatic  and  heteroaromatic  amines,  previously  collected  from 
the  literature  by  Debnath  er  a/.  [25],  were  used  to  study  mutagenicity.  The 
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mutagenicity  of  these  compounds  in  S.  Typhimurium  TA98  +  S9  microsomal 
preparation  has  been  expressed  as  positive  or  negative  mutagenicity  by 
Benigni  [26].  Compounds  included  in  this  study  and  their  mutagenic 
classification  based  on  experimentally  determined  mutagenic  potency  are 
given  in  Table  I.  Of  the  compounds  used  in  this  study,  106  were  classified  as 
mutagens  while  twenty-one  were  determined  to  be  non-mutagens. 


TABLE  I  Aromatic  and  heteroaromatic  amines1 


Chemicals 

7V498 

(Exp!) 

TA9S 

(Fred.)2 

2-Bromo-7-aminofluorene 

1 

1 

2-Methoxy-5-methylaniline  (p-cresidine) 

1 

1 

5-Aminoquinoline 

1 

1 

4-Ethoxyaniline  (p-phenetidine) 

1 

1 

1  -Aminonaphthalene 

1 

1 

4-Aminofluorene 

1 

1 

2-Aminoanthracene 

1 

1 

7-Aminofluoranthene 

1 

1 

8-Aminoquinoline 

1 

1 

l  ,7-Diaminophenazine 

1 

1 

2-Aminonaphthalene 

1 

1 

4-Aminopyrene 

I 

1 

3-Amino-3'-nitrobiphenyl 

1 

1 

2,4,5-T  rime  thylani  line 

1 

1 

3-Aminofluorene 

l 

1 

3,3'-E>ichIorobenzidine 

1 

1 

2,4-DimethylaniIine  (2,4-xylidine) 

1 

1 

2,7-Diaminofluorene 

1 

1 

3-Aminofluoranthene 

1 

1 

2-Aminofluorene 

1 

1 

2-Amino-4'-nitrobiphenyl 

l 

1 

4-Aminobiphenyl 

1 

1 

3-Methoxy-4-methylaniline  (o-cresidine) 

1 

0 

2-Aminocarbazole 

1 

1 

2-Amino-5-nitrophenol 

1 

1 

2,2'-DiaminobiphenyI 

1 

1 

2-Hydroxy-7-aminofluorene 

l 

1 

l  -Aminophenanthrene 

1 

1 

2,5-Dimethylaniline  (2,5-xylidine) 

1 

I 

4-Amino-2'-nitrobiphenyl 

1 

1 

2-Amino-4-methylphenol 

1 

1 

2-Aminophenazine 

1 

1 

4-Aminophenylsulfide 

1 

1 

2,4-Dinitroaniline 

1 

1 

2,4-Diaminoisopropylbenzene 

1 

l 

2,4-Difluoroaniline 

1 

l 

4,4'-Methylenedianiline 

I 

i 

3,3'-Dimethylbenzidine 

1 

1 

2-Aminofluoranthene 

1 

1 

2-Amino-3'-nitrobiphenyl 

1 

1 

1  -  Aminofluoranthene 

1 

1 
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TABLE  I 

(Continued) 

Chemicals 

7V498 

(Exp,.) 

TAW 

(Precl.)2 

4,4'-Ethylenebis  (aniline) 

1 

1 

4-Chloroaniline 

1 

1 

2-Aminophenanthrene 

1 

1 

4-Fluoroaniline 

1 

1 

9-Aminophenanthrene 

3,3'-Diaminobiphenyl 

1 

1 

1 

1 

2-Aminopyrene 

1 

1 

2,6-Dichloro- 1 ,4-phenylenediamine 

1 

1 

2-Amino-7~acetamidofluorene 

1 

1 

2,8-Diaminophenazine 

1 

1 

6-Aminoquinoline 

1 

1 

4-Methoxy-2-methylaniline  (m-cresidine) 
3“Amino-2'-nitrobiphenyl 

1 

l 

1 

1 

2,4/-Diamino-biphenyl 

1 

1 

1 ,6-Diaminophenazine 

1 

1 

4-AminophenyldisuIfide 

1 

1 

2-Bromo-4,6-dinitroaniline 

1 

1 

2,4-Diamino-rt-butylbenzene 

1 

0 

4-Aminophenylether 

1 

1 

2-Aminobiphenyl 

1 

1 

1 ,9-Diaminophenazine 

1 

1 

1-Aminofluorene 

1 

1 

8-Aminofluoranthene 

1 

1 

2-Chloroaniline 

I 

0 

1 

2-Amino-aaa-trifluorotoIuene 

1 

2- Amino-l-nitronaphtha!ene 

3- Amino-4'-nitrobiphenyl 

1 

1 

1 

1 

4-Bromoaniline 

1 

1 

2-Amino-4-chlorophenol 

3,3/-Dimethoxybenzidine 

1 

1 

I 

1 

4-Cyclohexylaniline 

1 

1 

4-Phenoxyaniline 

4,4'-MethyIenebis  (o-ethylaniline) 

1 

1 

1 

0 

2-Amino-7-Nitrofluorene 

1 

I 

Benzidine 

1 

1 

1  -Amino-4-Nitronaphthalene 
4‘Amino-3LNitrobiphenyI 

1 

1 

1 

1 

4-Amino-4/-Nitrobiphenyl 

1 

1 

1-Aminophenazine 

4,4'-Methylenebis  (o-fluoroaniline) 

1 

1 

1 

1 

4-Chloro-2-nitroaniline 

1 

1 

3-Aminoquinoline 

1 

1 

3-AminocarbazoIe 

1 

1 

4-ChIoro- 1 ,2-phenylenediami  nc 

1 

1 

3-Aminophenanthrene 

3,4/-Diaminobiphenyl 

1 

1 

1 

1 

1  -Aminoanthracene 

1 

1 

1-Aminocarbazole 

1 

1 

9-Aminoanthracene 

I 

1 

4-Aminocarbazole 

1 

1 

6-Aminochrysene 

1 

1 

1-Aminopyrene 

4-4'-Methylenebis  (^-isopropyl-aniline) 

1 

1 

I 

0 
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TABLE  I  (Continued) 


Chemicals 

TA9& 

(Expl.) 

7X98 

(Pred.f 

2,7-Diaminophenazine 

1 

1 

4-Aminophenanthrene 

0 

1 

2,4-Diaminotoluene 

1 

1 

3,3'-Diaminobenzidine 

1 

1 

1 ,3-Phenylenediamine 

1 

0 

3,4-Diaminotoluene 

1 

1 

1 ,2-Phenylenediamine 

1 

0 

3-Amino-6-methylphenoI 

1 

1 

2,4-Diaminoethylbenzene 

1 

1 

4,4'-Methylenebis  (2,6-diisopropylaniline) 

0 

0 

4,4'-MethyIenebis  (2,6-diethylaniline) 

0 

0 

4,4'-Methylenebis  (2-methyl-6-/-butylaniline) 

0 

0 

4,4'-Methylenebis  (2-methyl-6-isopropylaniline) 

0 

0 

4,4'-Methylenebis  (2-methyl-6-ethylaniline) 

0 

0 

4,4/-Methylenebis  (2,6-dimethylaniline) 

0 

1 

3-Aminobiphenyl 

0 

1 

2,3-Diaminobiphenyl 

0 

1 

2-Methyl-4-chloroaniline 

0 

1 

2-Chloro-4-methylaniline 

0 

I 

4-Methoxyaniline 

0 

1 

3-Methoxyaniline 

0 

1 

Aniline 

0 

0 

3-Chloroaniline 

0 

0 

3-Ethoxyaniline 

0 

1 

2-Ethoxyaniline 

0 

1 

4-Aminophenol 

0 

1 

3-Aminophenol 

0 

0 

2-Aminophenol 

0 

0 

2-Methoxyaniline 

0 

1 

4-Chloro- 1 ,3-phenylenediamine 

1 

1 

2-Nitro-l  ,4-phenylenediamine 

1 

1 

4-Nitro- 1 ,3-phenylenediamine 

1 

1 

4-Nitro- 1 ,2-phenylenediamine 

1 

1 

'  The  table  reports  the  mutagenicity  of  the  aromatic  and  heteroaromatic  amines  as:  0  =  negative; 

1  =  positive. 

2  TA98  results  predicted  using  topostructural  and  topochemical  indices. 


Computation  of  Indices 

Topological  indices  used  in  this  study  have  been  calculated  by  POLLY  2.3 
[27]  which  can  calculate  a  total  of  102  indices.  These  indices  include  Wiener 
index  [28],  connectivity  indices  [1 1, 12],  information  theoretic  indices  defined 
on  distance  matrices  of  graphs  [13, 14],  a  set  of  parameters  derived  on  the 
neighborhood  complexity  of  vertices  in  hydrogen-filled  molecular  graphs 
[15-18],  as  well  as  Balaban’s  J  indices  [19-21],  Table  II  provides  brief  de¬ 
finitions  for  the  topological  indices  included  in  this  study. 
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TABLE  !1  Symbols,  definitions  and  classifications  of  topological  parameters 


Topostructural 

Io  Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 

_w  vertices  of  a  graph 

ID  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  -  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of 

a  graph 

ID  Degree  complexity 

Hy  Graph  vertex  complexity 

H°  Graph  distance  complexity 

IC  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 

occurrences  of  distance  h 

O  Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen- 

filled  graph 

M j  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M 2  A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 
hX  Path  connectivity  index  of  order  h  -  0-6 

hXc  Cluster  connectivity  index  of  order  h  —  3-6 

hXch  Chain  connectivity  index  of  order  h  =  3-6 

*Xpc  Path-cluster  connectivity  index  of  order  h  -  4-6 

Pi,  Number  of  paths  of  length  /i  =  0—10 

J  Balaban’s  J  index  based  on  distance 

Topochemical 

Iorb  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 

maximum  neighborhood  of  vertices 

ICr  Mean  information  content  or  complexity  of  a  graph  based  on  the  rth  (r  =  0-6) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
SICr  Structural  information  content  for  rth  (r  —  0-6)  order  neighborhood  of  vertices 

in  a  hydrogen-filled  graph 

CICr  Complementary  information  content  for  rth  (r  -  0-6)  order  neighborhood  of 

vertices  in  a  hydrogen-filled  graph 
hXb  Bond  path  connectivity  index  of  order  h  =  0-6 

hXbc  Bond  cluster  connectivity  index  of  order  h  —  3-6 

hXcb  Bond  chain  connectivity  index  of  order  h  —  3-6 

aXpc  Bond  path-cluster  connectivity  index  of  order  h  -  4-6 

hXv  Valence  path  connectivity  index  of  order  h  -  0-6 

hXc  Valence  cluster  connectivity  index  of  order  h  -  3-6 

AXch  Valence  chain  connectivity  index  of  order  h  -  3-6 

hXpc  Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

JB  Balaban’s  J  index  based  on  bond  types 

Jx  Balaban’s  J  index  based  on  relative  electronegativities 

JY  Balaban’s  J  index  based  on  relative  covalent  radii 


Values  for  log  P  and  the  quantum  chemical  parameters  Ehomo  anc* 
£lumo  were  taken  from  the  work  of  Debnath  et  al.  [25].  Octanol/water 
partition  coefficients  (log  P)  were  determined  experimentally  for  a  set  of  67 
aromatic  and  heteroaromatic  amines  and,  when  these  values  were  determined 
to  be  in  agreement  with  values  calculated  using  the  CLOGP  program  (release 
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3.54),  the  remainder  of  the  log  P  values  were  calculated  using  CLOGP  [29]. 
The  quantum  chemical  parameters  provided  by  Debnath  et  al Ehomo  and 
Elumo  were  calculated  using  the  semi-empirical  AMI  of  MOPAC  4.10 
(Quantum  Chemistry  Program  Exchange  No.  455)  [30]. 


Data  Reduction 

Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus 
one.  This  was  done  since  the  scale  of  some  indices  may  be  several  orders  of 
magnitude  greater  than  that  of  other  indices  and  other  indices  may  equal  zero. 

The  set  of  95  TIs  was  partitioned  into  two  distinct  sets:  38  topostructural 
indices  and  57  topochemical  indices.  Topostructural  indices  are  indices 
which  encode  information  about  the  adjacency  and  distances  of  atoms 
(vertices)  in  molecular  structures  (graphs)  irrespective  of  the  chemical  nature 
of  the  atoms  involved  in  the  bonding  or  factors  like  hybridization  states  of 
atoms  and  number  of  core/valence  electrons  in  individual  atoms.  Topo¬ 
chemical  indices  are  parameters  which  quantify  information  regarding  the 
topology  (connectivity  of  atoms)  as  well  as  specific  chemical  properties  of 
the  atoms  comprising  a  molecule.  Topochemical  indices  are  derived  from 
weighted  molecular  graphs  where  each  vertex  (atom)  is  properly  weighted 
with  selected  chemical/physical  properties.  The  categorization  of  the  95  TIs 
into  these  sets  is  shown  in  Table  II. 

To  further  reduce  the  number  of  independent  variables  to  be  used  for  model 
construction,  the  sets  of  topostructural  and  topochemical  indices  were  further 
divided  into  subsets,  or  clusters,  based  on  the  correlation  matrix  using  the  S  AS 
procedure  VARCLUS  [31].  This  variable  clustering  procedure  divides  the 
set  of  indices  into  disjoint  clusters  such  that  each  cluster  is  essentially 
unidimensional.  The  index  most  correlated  with  each  cluster,  as  well  as  any 
indices  which  were  poorly  correlated  with  the  cluster  (r  <  0.70),  were  selected 
for  model  development.  Variable  clustering  was  performed  independently  for 
both  the  topostructural  and  topochemical  subsets. 


Statistical  Analysis  and  Hierarchical  DFA 

Selection  of  indices  for  the  final  models  was  conducted  using  all  subsets 
regression  on  the  sets  of  indices  chosen  through  variable  cluster  analysis  in 
the  SAS  procedure  REG  [32].  This  all  subsets  procedure  was  performed  on 
four  distinct  sets  of  indices:  (1)  the  topostructural  indices  selected  by  variable 
clustering,  (2)  the  topostructural  indices  selected  in  all  subsets  regression  and 
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the  topochemical  indices  selected  during  variable  clustering,  (3)  the 
topostructural  and  topochemical  indices  selected  in  all  subsets  regression 
and  lo gP,  and  4)  the  model  chosen  for  topostructural  and  topochemical 
indices  with  log  P  and  with  the  addition  of  Ghomo  and  Glumo-  These  sets  of 
indices  were  then  used  to  develop  and  crossvalidate  discriminant  function 
models  for  classifying  the  mutagenicity/non-mutagenicity  of  the  1 27  aromatic 
and  heteroaromatic  amines.  Figure  1  illustrates  the  process  for  the  selection  of 
indices  and  formulation  of  DFA  models. 


RESULTS  AND  DISCUSSION 

In  the  first  step  of  our  hierarchical  modeling,  38  topostructural  parameters 
were  subjected  to  variable  clustering  procedure.  The  following  indices  were 
retained  from  the  five  clusters  generated:  I^,IC,0,4xc*6XCh,4XPC»^3»^- 
These  five  clusters  explained  a  total  variation  of  35.29  and  the  proportion  of 
the  variance  explained  was  equal  to  92.86%.  Of  the  57  topochemical  indices, 
the  following  ten  indices  were  selected  from  eight  clusters:  IC0,  IC2,IC4, 
SIC2,  SIC4,4Xo6Xch’4Xpc’2Xv»^r*  The  eight  clusters  generated  from  the 
topochemical  indices  resulted  in  a  total  variation  explained  of  51.65  and  the 
proportion  of  the  variance  explained  was  equal  to  90.61%.  These  indices 
were  then  included  in  the  all  subsets  regression  procedure  for  the  selection  of 
final  indices  for  discriminant  function  analysis.  In  all  cases,  the  RSQUARE 
and  ADJRSQ  values  were  examined  as  indicators  of  model  fit,  however  the 
final  models  were  selected  based  on  the  Mallow’s  Cp  statistic  (CP).  Statistics 
for  the  cluster  analysis  and  the  inter-correlation  of  the  clusters  for  the  topo¬ 
structural  indices  are  presented  in  Tables  III  and  IV,  respectively.  Similar 
statistics  for  the  variable  clustering  of  the  topochemical  indices  can  be  found 
in  Tables  V  and  VI. 

The  all  subsets  regression  of  the  eight  topostructural  indices  resulted  in 
the  selection  of  the  following  indices  for  model  development:  I^,IC,  P3. 
These  indices  were  used  to  create  the  topostructural  DFA  model,  the 
simplest  model  in  the  hierarchy,  and  were  also  combined  with  the  ten 
topochemical  indices  to  create  the  second  model  in  the  hierarchy.  All  subsets 
regression  of  the  thirteen  topostructural  and  topochemical  indices  resulted 
in  the  selection  of  the  following  indices  for  modeling:  i)^,  IC,  P3,  IC0,SIC2. 
These  indices  were  combined  with  logP  and  resulted  in  a  six  parameter 
model  with  log  P  added  to  the  complete  set  of  descriptors  from  the  second 
model.  Finally,  the  quantum  chemical  descriptors,  Ghomo  and  Glumo,  were 
combined  with  the  set  of  six  indices  and  all  subsets  regression  was  used  again 
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FIGURE  l  Illustration  of  the  hierarchical  method  of  index  selection  and  discriminant 
function  analysis. 


to  select  the  best  parameters  for  model  construction.  This  procedure  resulted 
in  the  selection  of  the  following  model:  iJ^IC,  P3,  logF,  GLumo- 

Discriminant  function  analysis,  using  the  SAS  procedure  DISCRIM  [33], 
was  used  to  develop  models  for  predicting  mutagenicity/non-mutagenicity 
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TABLE  III  Statistics  for  the  variable  cluster  analysis  of  the  topostructural  indices 


Cluster 

Members 

Variation 

explained 

Proportion 

explained 

Second 

eigenvalue 

Index  most 
correlated 

Correlation 

1 

18 

16.99 

0.94 

0.71 

Py 

0.9918 

2 

2 

2.00 

1.00 

0.00 

4Xc 

0.9992 

3 

3 

2.15 

0.71 

0.72 

6XCh 

0.9104 

4 

12 

11.41 

0.95 

0.45 

,w 

lD 

0.9977 

5 

3 

2.73 

0.91 

0.18 

4 

Xpc 

0.9474 

TABLE  IV  Intercorrelation  of  the  clusters  generated  in  the  variable  cluster  analysis  of  the 
topostructural  indices 

Cluster 

1 

2 

3 

4 

5 

1 

I^R!8 

2 

wm::.  m 

3 

ilH 

4 

0.1389 

1.0000 

5 

0.7131 

0.4006 

mSM 

0.7793 

1.0000 

TABLE  V  Statistics  for  the  variable  cluster  analysis  of  the  topochemical  indices 


Cluster 

Members 

Variation 

explained 

Proportion 

explained 

Second 

eigenvalue 

Index  most 
correlated 

Correlation 

1 

19 

17.61 

0.93 

0.58 

2xr 

0.9686 

2 

8 

7.52 

0.94 

0.42 

SIC4 

0.9876 

3 

4 

3.76 

0.94 

0.24 

4Xc 

0.9484 

4 

6 

5.11 

0.85 

0.80 

JY 

0.8889 

5 

5 

4.72 

0.94 

0.23 

IC4 

0.9880 

6 

4 

3.72 

0.93 

0.27 

6  VA 

Xch 

0.9419 

7 

6 

4.68 

0.78 

0.79 

SIC2 

0.9079 

8 

5 

4.52 

0.90 

0.21 

4  A 

xpc 

0.9225 

TABLE  VI  Intercorrelation  of  the  clusters  generated  in  the  variable  cluster  analysis  of  the 
topochemical  indices 


Cluster 

I 

2 

3 

4 

5 

6 

7 

8 

1 

2 

-0.4121 

3 

0.2311 

■ESI 

4 

-0.8162 

e£m9 

5 

0.6649 

-0.0641 

-0.2594 

6 

0.4739 

0.2192 

-0.0509 

-0.4812 

0.5033 

7 

-0.5604 

0.4636 

-0.1072 

0.7565 

-0.0130 

1.0000 

8 

-0.5046 

0.5542 

-0.4287 

0.0484 

-0.2913 

1.0000 
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TABLE  VII 

Results  of  the  cross-validated  discriminant  function  analyses 

Hierarchical  classes 

Indices 

%  Correct 

%  Correct 

(non-mutagens) 

(mutagens) 

Topostructural 

10,7*3 

28.6 

95.3 

Topostructural  + 

i£,ic,p3. 

42.9 

93.4 

Topochemical 

ICo,SIC2 

Topological  +  log  P 

lo.ic,  7*3. 

IC0,  SIC2,  log  7* 

38.1 

95.3 

Topological  +  log  P  + 

Io.IC,  7*3, 

33.3 

95.3 

Quantum  chemical 

log  7*,  €lumo 

of  chemicals  in  the  Ames  test.  Four  distinct  models  were  developed  using  the 
indices  selected  from  the  all  subsets  regression  procedure  as  described  above. 
The  results  in  Table  VII  shows  that  all  four  models  could  predict  the  muta¬ 
genicity  of  chemicals  93%  to  95%  of  the  time  whereas  they  were  less  effective 
in  predicting  non-mutagenicity  (29%  to  43%). 

The  addition  of  topochemical  to  the  set  of  topostructural  indices,  result¬ 
ing  in  the  best  predictive  model,  are  shown  in  Table  VII.  It  is  clear  from  the 
results  that  the  addition  of  topochemical  indices  to  the  set  of  topostructural 
indices  did  slightly  decrease  the  prediction  of  mutagenicity.  However,  there 
was  a  significant  improvement  in  the  prediction  of  non-mutagenicity  by  the 
addition  of  topochemical  indices  to  the  set  of  independent  variables. 

Finally,  the  addition  of  log  P  and  quantum  chemical  indices  did  not  make 
any  improvement  in  the  models.  This  is  in  line  with  our  earlier  work  with 
physical  and  biochemical  properties  which  showed  that  topostructural  and 
topochemical  indices  explain  most  of  the  variance  in  the  data  [3,  6- 10]. 
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Abstract 

This  paper  presents  a  novel  and  effective 
genetic  algorithm  approach  for  generating 
computational  models  for  hazard  assessment. 
With  millions  of  proposed  chemicals  being 
registered  each  year,  it  is  impossible  to  come 
even  remotely  close  to  completing  the  battery 
of  tests  needed  for  the  proper  understanding 
of  the  toxic  effects  of  these  chemicals.  Com¬ 
puter  models  can  give  quick,  cheap,  and  en¬ 
vironmentally  friendly  hazard  assessments  of 
chemicals.  Our  approach  works  by  first  ex¬ 
tracting  a  hierarchy  of  theoretical  descriptors 
of  the  structure  of  a  compound,  then  filtering 
these  numerous  descriptors  with  a  genetic  al¬ 
gorithm  approach  to  ensemble  feature  selec¬ 
tion.  We  tested  the  utility  of  our  approach  by 
modeling  the  acute  aquatic  toxicity  (LCso) 
of  a  congeneric  set  of  69  benzene  derivatives. 

Our  results  demonstrate  a  very  important 
point:  that  our  method  is  able  to  accurately 
predict  toxicity  directly  from  structure. 

1  INTRODUCTION 

By  the  end  of  1998  the  number  of  chemicals  registered 
with  the  Chemical  Abstract  Service  rose  to  over  19 
million  (CAS  1999).  This  is  an  increase  of  over  3 
million  chemicals  between  1996  and  1998.  It  is  de¬ 
sirable  to  test  each  of  these  chemicals  for  their  effects 
on  the  environment  and  human  health  (which  we  re¬ 
fer  to  as  hazard  assessment) ;  however,  completing  the 
battery  of  tests  necessary  for  the  proper  hazard  as¬ 
sessment  of  even  a  single  compound  is  a  costly  and 
time-consuming  process.  Therefore,  there  is  simply 
not  enough  time  or  money  to  complete  these  test  bat¬ 
teries  for  even  a  tiny  portion  of  the  compounds  which 
are  registered  today  (Menzel  1995).  An  alternative  to 
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these  traditional  test  batteries  is  to  develop  computa¬ 
tional  models  for  hazard  assessment.  Computational 
models  are  fast  (milliseconds  per  compound),  cheap 
(less  than  one  cent  per  compound),  and  do  not  run 
the  risk  of  adversely  affecting  the  environment  during 
testing.  Additionally,  these  computational  methods 
can  replace  or  limit  the  amount  of  animal  testing  that 
is  necessary.  Thus  computational  models  can  easily 
process  all  registered  chemicals  and  flag  the  ones  that 
require  further  testing.  The  central  problem  with  this 
approach  is  developing  class  specific  models  that  can 
be  considered  accurate  enough  to  be  useful.  In  this 
paper,  we  present  a  novel  and  effective  approach  for 
learning  computational  hazard  assessment  models  by 
using  an  ensemble  feature  selection  algorithm  based  on 
genetic  algorithms  (GAs)  to  filter  numerous  theoreti¬ 
cal  descriptors  of  chemical  structure. 

To  better  illustrate  the  need  for  effective  and  quick 
hazard  assessment,  we  should  consider  the  situation 
of  the  industrial  chemicals  ’’grandfathered”  into  con¬ 
tinued  use  under  the  Toxic  Substances  Control  Act 
(TSCA)  of  1976.  TSCA  has  required  that  a  suite  of 
physicochemical  and  toxicological  screens  be  run  on  all 
commercial  compounds  (those  produced  or  imported 
in  volumes  exceeding  one  million  pounds  annually)  de¬ 
veloped  after  1976.  However,  there  are  almost  3,000 
chemicals  that  were  ’’grandfathered”  in  with  the  un¬ 
derstanding  that  it  would  be  the  responsibility  of  the 
chemical  manufacturing  industry  to  ultimately  sup¬ 
ply  information  about  these  chemicals.  Only  recently, 
after  a  20-year  delay,  are  the  chemical  manufactur¬ 
ers  talking  about  running  2,800  of  these  compounds 
through  basic  toxicity  screens  and  while  this  is  promis¬ 
ing,  these  screens  will  not  be  completed  until  2004  and 
at  a  cost  of  between  $500  to  $700  million  dollars.  So  it 
will  be  another  five  years  before  we  have  basic  toxicity 
data  on  compounds  that  have  been  in  wide-spread  use 
for  more  than  twenty  years  (Johnson  1998). 

One  of  the  fundamental  principles  of  biochemistry  is 


that  activity  is  dictated  by  structure  (Hansch  1976). 
Following  this  principle,  one  can  use  theoretical  molec¬ 
ular  descriptors  that  quantify  structural  aspects  of 
a  molecule  to  quantitatively  determine  its  activity 
(Basak  &  Grunwald  1995;  Cramer,  Famini,  &  Lowrey 
1993).  These  theoretical  descriptors  can  be  generated 
directly  from  the  known  structure  of  the  molecule  and 
used  to  estimate  its  properties,  without  the  need  for 
further  experimental  data.  This  is  important  due  to 
that  fact  that,  with  chemicals  needing  to  be  evaluated 
for  hazard  assessment,  there  is  a  scarcity  of  available 
experimental  deCta  that  is  normally  required  as  inputs 
(i.e.,  independent  variables)  to  traditional  quantitative 
structure-activity  relationship  (QS  AR)  model  develop¬ 
ment.  A  QSAR  model  based  solely  on  theoretical  de¬ 
scriptors  on  the  other  hand  can  process  all  registered 
chemicals  for  hazard  assessment. 

Our  hierarchical  approach  examines  the  relative  con¬ 
tributions  of  theoretical  descriptors  of  gradually  in¬ 
creasing  complexity  (structural,  chemical,  shape,  and 
quantum  chemical  descriptors).  This  approach  is  im¬ 
portant  as  none  of  the  individual  classes  of  parame¬ 
ters  are  very  effective  at  predicting  toxicity  (Gute  & 
Basak  1997);  however,  we  show  in  this  paper  that  we 
can  effectively  predict  toxicity  if  we  combine  all  levels 
of  descriptors.  One  potential  problem  with  using  our 
hierarchical  approach  is  that  it  often  gives  many  in¬ 
dependent  variables  as  compared  to  data  points  since 
having  a  limited  number  of  data  points  in  not  uncom¬ 
mon  in  hazard  assessment.  For  instance,  in  our  case 
study  of  predicting  acute  toxicity  (LC$o)  of  benzene 
derivatives,  we  have  95  independent  variables  and  69 
data  points.  Therefore,  reducing  the  number  of  inde¬ 
pendent  variables  is  critical  when  attempting  to  model 
small  data  sets.  The  smaller  the  data  set,  the  greater 
the  chance  of  spurious  error  when  using  a  large  num¬ 
ber  of  independent  variables  (descriptors).  In  some 
of  our  earlier  QSAR  studies  we  have  used  statistical 
methods  such  as  principal  components  analysis  (PCA) 
and  variable  clustering  methods  to  reduce  the  num¬ 
ber  of  independent  variables  (Basak  &  Grunwald  1995; 
Gute  &  Basak  1997;  Gute,  Grunwald,  &  Basak  In 
press). 

As  an  alternative  solution,  we  use  our  previous  en¬ 
semble  feature  selection  approach  (Opitz  1999)  that 
is  based  on  GAs.  An  “ensemble”  is  a  combination 
of  the  outputs  from  a  set  of  models  that  are  gener¬ 
ated  from  separately  trained  inductive  learning  algo¬ 
rithms.  Ensembles  have  been  shown  to,  in  most  cases, 
greatly  improve  generalization  accuracy  over  a  single 
learning  model  (Breiman  1996;  Maclin  &  Opitz  1997; 
Shapire  et  al.  1997).  Recent  research  has  shown  that 
an  effective  ensemble  should  consist  of  a  set  of  models 


that  are  not  only  highly  correct,  but  ones  that  make 
their  errors  on  different  parts  of  the  input  space  as 
well  (Hansen  &  Salamon  1990;  Krogh  &  Vedelsby  1995; 
Opitz  &;  Shavlik  1996a).  Varying  the  feature  subsets 
used  by  each  member  of  the  ensemble  helps  promote 
the  necessary  diversity  and  create  a  more  effective  en¬ 
semble  (Opitz  1999).  We  use  GAs  to  search  through 
the  enormous  space  of  finding  a  set  of  feature  subsets 
that  will  promote  disagreement  among  the  component 
members  of  an  ensemble  while  still  maintaining  the 
component  member’s  accuracy. 

Combining  our  approach  of  generating  hierarchical 
theoretical  descriptors  with  our  other  approach  to  GA- 
based  ensemble  feature  selection,  we  are  able  to  gen¬ 
erate  an  effective  model  for  predicting  the  toxicity  of 
benzene  derivatives  using  only  a  few  compounds.  Our 
results  show  that  our  model  is  nearly  as  accurate  as  the 
battery  of  tests  necessary  for  the  proper  hazard  assess¬ 
ment  of  a  single  compound.  Our  results  also  confirm 
that  our  new  ensemble  feature  selection  approach  is 
more  effective  than  previous  approaches  for  modeling 
hazard  assessment. 

The  rest  of  the  paper  is  organized  as  follows.  First 
we  provide  background  and  related  work  for  both  our 
hierarchical  QSAR  approach  and  our  GA-based  en¬ 
semble  feature  selection  approach.  This  is  followed  by 
results  of  our  approach  applied  to  benzene  derivatives. 
Finally,  we  discuss  these  results  and  provide  future 
work. 

2  QSAR  AND  THEORETICAL 
METHODS 

QSARs  have  come  into  widespread  use  for  the  pre¬ 
diction  of  various  molecular  properties,  as  well  as  bi¬ 
ological,  pharmacological  and  toxicological  responses. 
Traditional  QSAR  techniques  use  empirical  properties 
(Dearden  1990;  Hansch  &  Leo  1995;  de  Waterbeemd 
1995);  however,  due  to  the  scarcity  of  available  data 
for  the  majority  of  chemicals  needing  to  be  evaluated 
for  hazard  assessment,  these  physicochemical  proper¬ 
ties  necessary  for  traditional  QSAR  model  develop¬ 
ment  may  not  be  available.  When  this  is  the  case,  it 
is  imperative  that  there  are  methods  available  which 
make  use  of  nonempirical  parameters,  which  we  term 
theoretical  molecular  descriptors. 

Topological  indices  (TIs)  are  numerical  graph  invari¬ 
ants  that  quantify  certain  aspects  of  molecular  struc¬ 
ture  (Gute  &  Basak  1997;  Gute,  Grunwald,  &  Basak 
In  press).  The  different  classes  of  TIs  provide  us 
with  nonempirical,  quantitative  descriptors  that  can 
be  used  in  place  of  experimentally  derived  descriptors 


in  QSARs  for  the  prediction  of  properties. 

Our  recent  studies  have  focused  on  the  role  of  different 
classes  of  theoretical  descriptors  of  increasing  levels  of 
complexity  and  their  utility  in  QSAR  (Gute  &  Basak 
1997;  Gute,  Grunwald,  &  Basak  In  press).  Four  dis¬ 
tinct  sets  of  theoretical  descriptors  have  been  used  in 
this  study:  topostructural,  topochemical,  geometric, 
and  quantum  chemical  indices.  Gute  and  Basak  1997 
provide  the  detailed  list  of  the  indices  included  in  our 
study. 

2*1  TOPOLQGICAL  INDICES 

The  topostructural  and  topochemical  indices  fall  into 
the  category  normally  considered  topological  indices. 
Topostructural  indices  (TSIs)  are  topological  indices 
that  only  encode  information  about  the  adjacency  and 
distances  of  atoms  (vertices)  in  molecular  structures 
(graphs),  irrespective  of  the  chemical  nature  of  the 
atoms  involved  in  bonding  or  factors  such  as  hybridiza¬ 
tion  states  and  the  number  of  core/valence  electrons 
in  individual  atoms.  Topochemical  indices  (TCIs) 
are  parameters  that  quantify  information  regarding 
the  topology  (connectivity  of  atoms),  as  well  as  spe¬ 
cific  chemical  properties  of  the  atoms  comprising  a 
molecule.  These  indices  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom)  or  edge 
(bond)  is  properly  weighted  with  selected  chemical  or 
physical  property  information. 

The  complete  set  of  topological  indices  used  in  this 
study,  both  the  topostructural  and  the  topochemi¬ 
cal,  have  been  calculated  using  POLLY  2.3  (Basak, 
Harriss,  &  Magnuson  1988)  and  software  developed 
by  the  authors.  These  indices  include  the  Wiener  in¬ 
dex  (Wiener  1947),  the  connectivity  indices  developed 
by  Randic  1975  and  higher  order  connectivity  indices 
formulated  by  Kier  and  Hall  1986,  bonding  connec¬ 
tivity  indices  defined  by  Basak  and  Magnuson  1988, 
a  set  of  information  theoretic  indices  defined  on  the 
distance  matrices  of  simple  molecular  graphs  (Hansch 
&  Leo  1995),  and  neighborhood  complexity  indices  of 
hydrogen-filled  molecular  graphs,  and  Balaban’s  1983 
J  indices. 

2.2  GEOMETRICAL  INDICES 

The  geometrical  indices  are  three-dimensional  Wiener 
numbers  for  hydrogen-filled  molecular  structure, 
hydrogen-suppressed  molecular  structure,  and  van  der 
Waals  volume.  Van  der  Waals  volume,  Vw  (Bondi 
1964),  was  calculated  using  Sybyl  6.1  from  Tripos  As¬ 
sociates,  Inc.  of  St.  Louis.  The  3-D  Wiener  numbers 
were  calculated  by  Sybyl  using  an  SPL  (Sybyl  Pro¬ 


gramming  Language)  program  developed  in  our  lab 
(SYBYL  1998).  Calculation  of  3-D  Wiener  numbers 
consists  of  the  sum  entries  in  the  upper  triangular  sub¬ 
matrix  of  the  topographic  Euclidean  distance  matrix 
for  a  molecule.  The  3-D  coordinates  for  the  atoms 
were  determined  using  CONCORD  3.0.1  from  Tripos 
Associates,  Inc.  Two  variants  of  the  3-D  Wiener  num¬ 
ber  were  calculated:  30  Wh  and  30 W.  For  30  Wh, 
hydrogen  atoms  are  included  in  the  computations  and 
for  3DW  hydrogen  atoms  are  excluded  from  the  com¬ 
putations. 

2.3  QUANTAM  CHEMICAL 
PARAMETERS 

The  following  quantum  chemical  parameters  were  cal¬ 
culated  using  the  Austin  Model  version  one  (AMI) 
semi-empirical  Hamiltonian:  energy  of  the  highest  oc¬ 
cupied  molecular  orbital  (Ehomo)^  energy  of  the  sec¬ 
ond  highest  occupied  molecular  orbital  ( Ehomoi), 
energy  of  the  lowest  unoccupied  molecular  orbital 
(Elumo),  energy  of  the  second  lowest  unoccu¬ 
pied  molecular  orbital  (Elumoi),  heat  of  formation 
(AiJ/),  and  dipole  moment  (fi).  These  parameters 
were  calculated  using  MOPAC  6.00  in  the  SYBYL  in¬ 
terface  (Stewart  1990). 

3  FILTERING  DESCRIPTORS 

As  stated  above,  one  potential  problem  with  including 
all  theoretical  descriptors  in  the  hierarchy  is  that  it 
gives  many  independent  variables  when  compared  to 
the  limited  number  of  data  points  available  for  hazard 
assessment  modeling  of  a  particular  chemical  deriva¬ 
tive.  Compounding  this  problem  is  that  a  salient  de¬ 
scriptor  for  one  hazard  assessment  model  may  not  be  a 
salient  descriptor  for  another  problem.  That  is,  the  rel¬ 
evance  of  a  descriptor  for  predicting  hazard  assessment 
is  often  problem  dependent.  This  section  describes 
our  approach  for  automatically  filtering  the  descrip¬ 
tors  with  a  GA-based  approach  to  ensemble  feature 
detection.  Before  explaining  our  algorithm,  we  briefly 
cover  the  notion  of  ensembles. 

3.1  ENSEMBLES 

Figure  1  illustrates  the  basic  framework  of  a  predictor 
ensemble.  Each  predictor  in  the  ensemble  (predictor  1 
through  predictor  N  in  this  case)  is  first  trained  using 
the  training  instances.  Then,  for  each  example,  the 
predicted  output  of  each  of  these  predictors  (o*  in  Fig¬ 
ure  1)  is  combined  to  produce  the  output  of  the  ensem¬ 
ble  (6  in  Figure  1).  Many  researchers  (Breiman  1996; 
Hansen  &  Salamon  1990;  Krogh  &  Vedelsby  1995; 
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Figure  1:  A  predictor  ensemble. 


Opitz  &  Shavlik  1997)  have  demonstrated  the  effec¬ 
tiveness  of  combining  schemes  that  are  simply  the 
weighted  average  of  the  predictors  (i.e.,  6  =  Yli^N  ' 
Oi  and  ]CieJV  wi  ~  1)»  an(3  this  is  the  type  of  ensemble 
on  which  we  focus  in  this  article. 

Combining  the  output  of  several  predictors  is  useful 
only  if  there  is  disagreement  on  some  inputs.  Obvi¬ 
ously,  combining  several  identical  predictors  produces 
no  gain.  Hansen  and  Salamon  1990  proved  that  for  an 
ensemble,  if  the  average  error  rate  for  an  example  is 
less  than  50%  and  the  predictors  in  the  ensemble  are 
independent  in  the  production  of  their  errors,  the  ex¬ 
pected  error  for  that  example  can  be  reduced  to  zero 
as  the  number  of  predictors  combined  goes  to  infinity; 
however,  such  assumptions  rarely  hold  in  practice. 

Krogh  and  Vedelsby  1995  later  proved  that  the  ensem¬ 
ble  error  can  be  divided  into  a  term  measuring  the  av¬ 
erage  generalization  error  of  each  individual  predictor 
and  a  term  called  diversity  that  measures  the  disagree¬ 
ment  among  the  predictors.  Formally,  they  define  the 
diversity  term,  d*,  of  predictor  i  on  input  x  to  be: 

di(x)  =  [oi{x)  —  o(rc)]2.  (1) 

The  quadratic  error  of  predictor  i  and  of  the  ensemble 
are,  respectively: 

«,(*)  =  [<*(*) -/(*)]2f  (2) 

e{x)  =  [d(x)  -  f(x)}2,  (3) 

where  f(x)  is  the  target  value  for  input  x.  If  we  de¬ 
fine  E ,  E{ ,  and  D{  to  be  the  averages,  over  the  input 
distribution,  of  e(x),  e(x),  and  d(x)  respectively,  then 
the  ensemble’s  generalization  error  can  be  shown  to 
consist  of  two  distinct  portions: 

E  =  E-D,  (4) 


where  E  (=  *s  ^ie  weighted  average  of 

the  individual  predictor’s  generalization  error  and  D 
(=  YliwiDi)  ls  the  weighted  average  of  the  diversity 
among  these  predictors.  What  the  equation  shows 
then,  is  that  an  ideal  ensemble  consists  of  highly 
correct  predictors  that  disagree  as  much  as  possible. 
Opitz  and  Shavlik  199Ga;  199Gb  empirically  verified 
that  such  ensembles  generalize  well. 

Regardless  of  theoretical  justifications,  methods  for 
creating  ensembles  center  around  producing  predic¬ 
tors  that  disagree  on  their  predictions.  Generally, 
these  methods  focus  on  altering  the  training-  pro¬ 
cess  in  the  hope  that  the  resulting  predictors  will 
produce  different  predictions.  For  example,  neural 
network  techniques  that  have  been  employed  include 
methods  for  training  with  different  topologies,  differ¬ 
ent  initial  weights,  different  parameters,  and  training 
only  on  a  portion  of  the  training  set  (Alpaydin  1993; 
Freund  &  Schapire  1996;  Hansen  &  Salamon  1990; 
Maclin  &  Shavlik  1995). 

Numerous  techniques  try  to  generate  disagreement 
among  the  classifiers  by  altering  the  training  set  each 
classifier  sees.  The  two  most  popular  techniques 
are  Bagging  (Breiman  199G)  and  Boosting  (Freund 
&  Schapire  1996).  Bagging  is  a  bootstrap  ensem¬ 
ble  method  that  trains  each  network  in  the  ensemble 
with  a  different  partition  of  the  training  set.  It  gener¬ 
ates  each  partition  by  randomly  drawing,  with  replace¬ 
ment,  N  examples  from  the  training  set,  where  N  is 
the  size  of  the  training  set.  As  with  Bagging,  Boosting 
also  chooses  a  training  set  of  size  N  and  initially  sets 
the  probability  of  picking  each  example  to  be  1/N. 
After  the  first  network,  however,  these  probabilities 
change  to  emphasize  misclassified  instances.  A  large 
number  of  extensive  empirical  studies  have  shown  that 
these  are  highly  successful  methods  that  nearly  always 
generalize  better  than  their  individual  component  pre¬ 
dictors  (Bauer  &  Kohavi  1998;  Maclin  &  Opitz  1997; 
Quinlan  1996).  Neither  approach  is  appropriate  for 
our  domain  since  we  are  data  poor  and  cannot  afford 
to  waste  training  examples;  however,  we  are  feature 
rich  and  can  afford  to  create  diversity  by  instead  vary¬ 
ing  the  inputs  to  the  learning  algorithms.  Varying  the 
feature  subsets  to  create  a  diverse  set  of  accurate  pre¬ 
dictors  is  the  focus  of  the  next  section. 

3.2  THE  GEFS  ALGORITHM 

The  goal  of  our  algorithm  is  to  find  a  set  of  feature 
subsets  that  creates  an  ensemble  of  classifiers  (neural 
networks  in  this  study)  that  maximize  equation  1  while 
minimizing  equation  2.  The  space  of  candidate  sets  is 
enormous  and  thus  is  particularly  well  suited  for  ge- 


Table  1:  The  Gefs  algorithm. 

GOAL:  Find  a  set  of  input  subsets  to  create  an  accu¬ 
rate  and  diverse  classifier  ensemble. 

1.  Using  varying  inputs,  create  and  train  the  initial 
population  of  classifiers. 

2.  Until  a  stopping  criterion  is  reached: 

(a)  Use  genetic  operators  to  create  new  networks. 

(b)  Measure  the  diversity  of  each  network  with 
respectT  to  the  current  population. 

(c)  Normalize  the  accuracy  scores  and  the  diver¬ 
sity  scores  of  the  individual  networks. 

(d)  Calculate  fitness  of  each  population  member. 

(e)  Prune  the  population  to  the  N  fittest  net¬ 
works. 

(f)  Adjust  A. 

(g)  The  current  population  is  the  ensemble. 


netic  algorithms.  Table  1  summarizes  our  recent  algo¬ 
rithm  (Opitz  1999)  called  Gefs  (for  Genetic  Ensemble 
Feature  Selection)  that  uses  GAs  to  generate  a  set  of 
classifiers  that  are  accurate  and  diverse  in  their  predic¬ 
tions.  GEFS  starts  by  creating  and  training  its  initial 
population  of  networks.  The  representation  of  each  in¬ 
dividual  of  our  population  is  simply  a  dynamic  length 
string  of  integers,  where  each  integer  indexes  a  partic¬ 
ular  feature.  We  create  networks  from  these  strings 
by  first  having  the  input  nodes  match  the  string  of 
integers,  then  creating  a  standard  single-hidden-layer, 
fully  connected  neural  network.  Our  algorithm  then 
creates  new  networks  by  using  the  genetic  operators 
of  crossover  and  mutation. 

Gefs  trains  these  new  individuals  using  backpropoga- 
tion.  It  adds  new  networks  to  the  population  and 
then  scores  each  population  member  with  respect  to 
its  prediction  accuracy  and  diversity.  Gefs  normalizes 
these  scores,  then  defines  the  fitness  of  each  population 
member  (i)  to  be: 

Fitnessi  =  Accuracyi  +  A  Diver sityi  (5) 

where  A  defines  the  tradeoff  between  accuracy  and  di¬ 
versity.  Finally,  Gefs  prunes  the  population  to  the  N 
most- fit  members,  then  repeats  this  process.  At  every 
point  in  time,  the  current  ensemble  consists  of  sim¬ 
ply  averaging  (with  equal  weight)  the  predictions  of 
the  output  of  each  member  of  the  current  population. 
Thus  as  the  population  evolves,  so  does  the  ensemble. 

We  define  accuracy  to  be  network  Vs  training-set  accu¬ 


racy.  (One  may  use  a  validation-set  if  there  are  enough 
training  instances.)  We  define  diversity  to  be  the  av¬ 
erage  difference  between  the  prediction  of  our  compo¬ 
nent  classifier  and  the  ensemble.  We  then  separately 
normalize  both  terms  so  that  the  values  range  from 
0  to  1.  Normalizing  both  terms  allows  A  to  have  the 
same  meaning  across  domains. 

It  is  not  always  clear  at  what  value  one  should  set  A; 
therefore,  we  automatically  adjust  A  based  on  the  dis¬ 
crete  derivatives  of  the  ensemble  error  E ,  the  average 
population  error  £,  and  the  average  diversity  D  within 
the  ensemble.  First,  we  never  change  A  if  E  is  decreas¬ 
ing;  otherwise  we  (a)  increase  A  if  E  is  not  increasing 
and  the  population  diversity  D  is  decreasing;  or  (b) 
decrease  A  if  E  is  increasing  and  D  is  not  decreasing. 
We  started  A  at  1.0  for  the  experiments  in  this  article. 
The  amount  A  changes  is  10%  of  its  current  value. 

We  create  the  initial  population  by  randomly  choosing 
the  number  of  features  to  include  in  each  feature  sub¬ 
set.  For  classifier  i,  the  size  of  each  feature  subset  ( Ni ) 
is  independently  chosen  from  a  uniform  distribution 
between  1  and  twice  the  number  of  original  features 
in  the  dataset.  We  then  randomly  pick,  with  replace¬ 
ment,  Ni  features  to  include  in  classifier  Vs  training 
set.  Note  that  some  features  may  be  picked  multiple 
times  while  others  may  not  be  picked  at  all;  replicat¬ 
ing  inputs  for  a  neural  network  may  give  the  network 
a  better  chance  to  utilize  that  feature  during  training. 
Also,  replicating  a  feature  in  a  genome  encoding  allows 
that  feature  to  better  survive  to  future  generations. 

Our  crossover  operator  uses  dynamic-length,  uniform 
crossover.  In  this  case,  we  chose  the  feature  subsets  of 
two  individuals  in  the  current  population  proportional 
to  fitness.  Each  feature  in  both  parent’s  subset  is  in¬ 
dependently  considered  and  randomly  placed  in  the 
feature  set  of  one  of  the  two  children.  Thus  it  is  pos¬ 
sible  to  have  a  feature  set  that  is  larger  (or  smaller) 
than  the  largest  (or  smallest)  of  either  parent’s  fea¬ 
ture  subset.  Our  mutation  operator  works  much  like 
traditional  genetic  algorithms;  we  randomly  replace  a 
small  percentage  of  a  parent’s  feature  subset  with  new 
features.  With  both  operators,  the  network  is  trained 
from  scratch  using  the  new  feature  subset;  thus  no  in¬ 
ternal  structure  of  the  parents  are  saved  during  the 
crossover. 

4  RESULTS 

We  tested  the  utility  of  combining  our  approach  for 
generating  numerous  hierarchical  theoretical  descrip¬ 
tors  of  compounds  with  our  approach  for  filtering 
these  descriptors  with  GEFS  by  modeling  the  acute 


aquatic  toxicity  (LC5 0)  of  a  congeneric  set  of  69  ben¬ 
zene  derivatives.  The  data  was  taken  from  the  work 
of  Hall,  Kier  and  Phipps  1984  where  acute  aquatic 
toxicity  was  measured  in  fathead  minnow  (Pimephales 
promelas ).  Their  data  was  compiled  from  eight  other 
sources,  as  well  as  some  original  work  which  was  con¬ 
ducted  at  the  U.S.  Environmental  Protection  Agency 
(USEPA)  Environmental  Research  Laboratory  in  Du¬ 
luth,  Minnesota.  This  set  of  chemicals  was  composed 
of  benzene  and  68  substituted  benzene  derivatives. 

Table  2  gives  out  results.  We  studied  three  approaches 
for  modeling  toxicity:  (1)  giving  all  theoretical  descrip¬ 
tors  to  a  neural  network,  (2)  reducing  the  feature  set 
in  a  traditional  previously  published  (Gute  &  Basak 
1997)  manner,  and  (3)  using  our  new  genetic  algorithm 
technique  on  the  entire  feature  set  to  create  a  neu¬ 
ral  network  ensemble.  Results  for  our  approaches  are 
from  leave-one-out  experiments  (i.e.,  69  training/test 
set  partitions).  Leave-one-out  works  by  leaving  one 
data  point  out  of  the  training  set  and  giving  the  re¬ 
maining  instances  (68  in  this  case)  to  the  learning  algo¬ 
rithms  for  training.  (It  is  worth  noting  that  each  mem¬ 
ber  of  the  ensemble  sees  the  same  68  training  instances 
for  each  training/test  set  partition  and  thus  ensembles 
have  no  unfair  advantage  over  other  learners.)  This 
process  is  repeated  69  times  so  that  each  example  is 
a  part  of  the  test  set  once  and  only  once.  Leave-one- 
out  tests  generalization  accuracy  of  a  learner,  whereas 
training  set  accuracy  tests  only  the  learner’s  ability  to 
memorize.  Generalization  error  from  the  test  set  is  the 
true  test  of  accuracy  and  is  what  we  report  here. 

We  first  trained  neural  networks  using  all  95  param¬ 
eters.  The  networks  contained  15  hidden  units  and 
we  trained  the  networks  for  1000  epochs.  We  normal¬ 
ized  each  input  parameter  to  a  values  between  0  and  1 
before  training.  Additional  parameter  settings  for  the 
neural  networks  included  a  learning  rate  of  0.05,  a  mo¬ 
mentum  term  of  0.1,  and  weights  initialized  randomly 
between  -0.25  and  0.25.  With  all  95  input  parameters, 
the  neural  networks  obtained  a  test-set  correlation  co¬ 
efficient  between  predicted  toxicity  and  measured  toxi¬ 
city  (explained  variance)  of  R2  —  0.868  and  a  standard 
error  of  0.29.  Target  toxicity  measurements  ranged 
from  3.04  to  6.37. 

Our  first  method  for  feature-set  reduction  follows  the 
work  of  Gute  and  Basak  1997  on  toxicity  domains. 
Their  method  begins  by  using  the  VARCLUS  method 
of  SAS  1998  to  select  subsets  of  topostructural  and 
topochemical  parameters  for  QSAR  model  develop¬ 
ment.  With  this  method,  the  set  of  topological  in¬ 
dices  is  first  partitioned  into  two  distinct  sets,  the 
topostructural  indices  and  the  topochemical  indices. 


Table  2:  Relative  effectiveness  of  statistical  and  neural 
network  methods  in  estimating  LC50  of  69  benzene 
derivatives. 


Method 

R'2 

Standard  Error 

NN  with  95  inputs 

0.868 

0.29 

VARCLUS 

0.825 

0.32 

NN  with  Gefs 

0.893 

0.27 

To  further  reduce  the  number  of  independent  variables 
for  model  construction,  the  sets  of  topostructural  and 
topochemical  indices  were  further  divided  into  subsets, 
or  clusters,  based  on  the  correlation  matrix  using  the 
VARCLUS  procedure.  This  procedure  divides  the  set 
of  indices  into  disjoint  clusters,  such  that  each  clus¬ 
ter  is  essentially  unidimensional.  From  each  cluster 
we  selected  the  index  most  correlated  with  the  clus¬ 
ter,  as  well  as  any  indices  which  were  poorly  corre¬ 
lated  with  their  cluster  ( R 2  <  0.70).  The  variable 
clustering  and  selection  of  indices  was  performed  inde¬ 
pendently  for  both  the  topostructural  and  topochem¬ 
ical  indices.  This  procedure  resulted  in  a  set  of  five 
topostructural  indices  and  a  set  of  nine  topochemical 
indices.  These  indices  were  combined  with  the  three 
geometric  and  six  quantum  chemical  parameters  de¬ 
scribed  earlier.  Their  approach  then  applied  linear  re¬ 
gression  to  these  23  parameters.  This  study  found  that 
an  accurate  linear  regression  model  for  acute  aquatic 
toxicity  required  descriptors  from  all  four  levels  of  the 
hierarchy:  topostructural,  topochemical,  geometrical 
and  quantum  chemical.  This  model  utilized  seven  de¬ 
scriptors  and  obtained  an  explained  variance  (R2)  of 
0.863  and  a  standard  error  of  0.30  on  the  whole  data 
set  used  as  a  training  set.  Our  leave-one-out  experi¬ 
ment  gave  an  R?  =  0.825  and  a  standard  error  of  0.32. 

Finally  we  applied  our  genetic  algorithm  technique, 
Gefs,  using  all  95  parameters.  The  parameter  set¬ 
tings  for  the  networks  in  the  ensemble  were  the  same  as 
the  settings  for  the  single  networks  in  the  first  exper¬ 
iment.  Parameter  settings  for  the  genetic  algorithm 
portion  of  GEFS  includes  a  mutation  rate  of  50%,  a 
population  size  of  20,  a  A  =  1.0,  and  a  search  length 
of  100  networks  (20  networks  for  the  initial  population 
and  80  networks  created  from  crossover  and  mutation). 
While  the  mutation  rate  may  seem  high  as  compared 
with  traditional  genetic  algorithms,  certain  aspects  of 
our  approach  call  for  a  higher  mutation  rate  (such  as 
the  criterion  of  generating  a  population  that  cooper¬ 
ates  as  well  as  our  emphasis  on  diversity);  other  muta¬ 
tion  values  were  tried  during  our  pilot  studies.  With 
this  approach,  we  obtained  a  test-set  correlation  coef¬ 
ficient  of  R2  =  0.893  and  a  standard  error  of  0.27;  the 
initial  population  of  20  networks  obtained  a  test-set 


R2  =  0.835  and  a  standard  error  of  0.31. 

5  DISCUSSION  AND  FUTURE 
WORK 

The  correlation  coefficient  between  the  predicted  value 
from  the  computational  model  and  the  target  value 
derived  from  the  toxicity  test  is  an  extremely  informa¬ 
tive  metric  of  accuracy  in  this  case.  The  exact  numeric 
value  of  most  toxicity  tests  is  not  as  important  as  the 
relative  ordering  and  spread  of  these  values.  Thus, 
a  perfect  correlation  ( R 2  =  1.0)  between  the  compu¬ 
tation  model  and  target  toxicity  shows  the  computa¬ 
tional  model  is  as  informative  as  the  toxicity  obtained 
from  a  battery  of  expensive  and  time-consuming  tests 
-  regardless  of  the  standard  error.  Note  the  standard 
error  of  0.27  is  fairly  good,  given  the  toxicity  measure¬ 
ments  ranged  from  3.04  to  6.37. 

While  the  neural  network  technique  and  the  standard 
data-reduction  technique  obtained  decent  correlation 
with  measured  toxicity,  our  ensemble  technique  was 
about  20%  closer  to  perfect  correlation.  Note  that 
Gefs  produces  an  accurate  initial  population  and  that 
running  Gefs  longer  with  our  genetic  operators  can 
further  increase  performance.  Thus  our  approach  can 
be  viewed  as  an  “anytime”  learning  algorithm.  Such 
a  learning  algorithm  should  produce  a  good  concept 
quickly,  then  continue  to  search  concept  space,  report¬ 
ing  the  new  “best”  concept  whenever  one  is  found 
(Opitz  &  Shavlik  1997).  This  is  important  since,  for 
most  hazard  assessment,  an  expert  is  willing  to  wait 
for  days,  or  even  weeks,  if  a  learning  system  can  pro¬ 
duce  an  improved  model  for  predicting  toxicity. 

Our  results  demonstrate  a  very  important  point:  that 
our  method  is  able  to  accurately  predict  toxicity  di¬ 
rectly  from  structure.  Compared  to  the  actual  bat¬ 
tery  of  tests  necessary  to  measure  toxicity,  a  computer 
model  is  much  cheaper,  much  faster,  and  does  not  have 
a  negative  impact  on  the  environment.  It  is  important 
to  also  note  that  the  computer  model  does  not  have  to 
be  the  final  measurement  for  hazard  assessment;  addi¬ 
tional  tests  can  be  run  on  compounds  that  are  either 
flagged  by  the  model,  or  require  more  tests  by  the  na¬ 
ture  of  their  use  (such  as  a  benzene  derivative  that  may 
become  a  standard  fuel).  Not  only  can  good  computer 
models  become  filters,  they  will  probably  be  the  only 
viable  option  for  processing  all  registered  chemicals. 

While  the  method  proposed  here  has  proven  effective, 
there  is  much  future  work  that  needs  to  be  completed. 
For  instance,  we  plan  to  test  our  method  on  other  data 
sets  of  chemical  derivatives;  investigate  other  ensemble 
feature  selection  techniques;  investigate  variants  to  our 


genetic  algorithm  approach,  and  finally  investigate  the 
utility  of  other  descriptors,  such  as  bio-descriptors. 

6  CONCLUSIONS 

In  this  paper  we  presented  a  novel  approach  for  cre¬ 
ating  a  computer  model  for  hazard  assessment.  Our 
approach  works  by  first  extracting  a  hierarchy  of  theo¬ 
retical  descriptors  derived  from  the  structure  of  a  com¬ 
pound,  then  filtering  the  numerous  possible  descriptors 
with  a  genetic  algorithm  approach  to  ensemble  fea¬ 
ture  selection.  We  tested  the  utility  of  our  approach 
by  modeling  the  acute  aquatic  toxicity  (LC50)  of  a 
congeneric  set  of  69  benzene  derivatives.  Our  results 
demonstrate  the  ability  of  our  approach  to  accurately 
predict  toxicity  directly  from  structure.  Thus  our  new 
algorithm  further  increases  the  applicability  of  com¬ 
puter  models  to  the  problem  of  predicting  chemical 
activity  directly  from  its  structure. 
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Appendix  1.4  Information  theoretic  indices  of  neighborhood 
complexity  and  their  applications 
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Table  I  Correlation  coefficients  of  variables  with  the  principal  components  (only  the  10  most  highly  correlated  are  listed) 
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Figure  4  ( Continued ). 
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Table  II  Selected  topological  indices  for  38  isospectral  graphs  (Figure  4)  Neighborhood  Complexity  Indices  581 
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Table  IV  Parabolic  correlation  of  LD50  values  with  log  P  and  four  topological  indices 


Independent  LD 50  ( Control )  =  A  +  BX+  CX2 

Variable  (X) 

LDS 0  (CC14)  =  A  +  BX+  CX2 

A 

B 

C 

ra 

SD 

F 

A 

B 

C 

r2 

SD 

F 

log  P 

62.20 

-49.70 

14.30 

0.94 

11.04 

35.94 

50.50 

-  34.00 

7.34 

0.94 

6.70 

34.82 

TIC0 

340.00 

-  26.40 

0.54 

0.96 

9.13 

54.87 

216.00 

-15.00 

0.28 

0.95 

6.10 

43.12 

TIC, 

288.00 

- 16.30 

0.25 

0.86 

16.10 

14.25 

195.00 

-9.85 

0.13 

0.97 

4.68 

76.61 

CiCo 

718.00 

-  457.00 

74.80 

0.91 

12.99 

24.57 

407.00 

-235.00 

35.10 

0.97 

4.76 

74.05 

CIC| 

620.00 

-448.00 

83.50 

0.95 

9.62 

48.88 

364.00 

-239.00 

40.70 

0.96 

5.54 

53.27 

aFor  each  equation,  r  is  the  correlation  coefficient,  SD  the  standard  deviation,  and  Fthe  F-ratio  between  the  variances 
of  observed  and  calculated  values. 
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Table  VIII  Lipophilicity.  anesthetic  dose  (AD50)  in  mice,  and  molecular  descriptors  for  barbiturates  (Figure 

5) 
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structural  information  not  coded  by  other 


Table  X  Lipophilicity,  Iog(I/C),  the  potency  for  inhibition  of  Arbacia  egg  cell  division,  and  molecular  descriptors  for 
barbiturates  (Figure  5) 
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The  normal  boiling  point  for  cyanogen  is  -22  °C;  for  its  next  homologue,  malononitrile,  it  is  219  °C.  The 
difference,^  241  °C  is  apparently  the  highest  one  encountered  for  the  addition  of  a  single  methylene  group. 
Problems  connected  with  boiling  points  and  a  rationalization  for  this  observation  are  discussed  in  the  context 
of  intermolecular  forces  for  liquids.  A  quantitative  structure— property  relationship  (QSPR)  study  of  the 
normal  boiling  points  for  monohaloalkanes  and  for  the  corresponding  nitriles  is  reported.  The  behavior  of 
the  nitrile  group  as  a  pseudohalogen  is  also  discussed.  Normal  boiling  points  of  compounds  having  a  cyano 
group  bonded  to  an  electron-attracting  substituent  situate  the  CN  group  close  to  being  a  pseudohalogen,  but 
when  the  CN  group  is  bonded  to  electron-donor  substituents,  the  situation  changes. 


THE  LIQUID  STATE  AND  INTERMOLECULAR 
FORCES 

Intermolecular  forces  range  from  the  very  weak  ones  such 
as  those  existing  in  liquefied  noble  gases  to  the  strongest 
ones  (hydrogen  bonds)  existing  in  hydrogen  fluoride,  in 
dimers  of  carboxylic  acids  (even  in  vapor  state),  or  in  liquids 
with  multiple  hydroxy  groups  such  as  glycols  or  water.  The 
exceptional  features  of  water  (liquid  state  over  a  wide 
temperature  range,  expansion  on  freezing,  high  dielectric 
constant,  and  excellent  solvent  for  a  wide  variety  of 
substances)  are  responsible  for  making  life  possible  on  earth. 
Although  ionic  or  metallic  liquids  also  exist,  they  will  not 
be  discussed  here  because  they  are  not  molecular  liquids. 
One  should  mention  the  important  role  of  intermolecular 
forces  and  especially  of  hydrogen  bonding  in  all  life 
processes,  in  the  transcription/translation  processes  involving 
DNA,  in  protein  folding,  receptor-agonist  intercations, 
enzymatic  mechanisms,  etc. 

Whereas  intermolecular  forces  in  crystals  are  compounded 
with  conformational  restrictions  due  to  packing  factors, 
liquids  have  molecular  and  conformational  mobility  (except 
for  liquid  crystals  within  certain  limits).  Liquids  are  more 
difficult  to  model  than  gases  or  solids.  However,  melting 
points  of  crystalline  solids  are  also  difficult  to  correlate  with 
chemical  structure  due  to  packing  factors,  except  for  some 
classes  of  congeneric  compounds. 

Intermolecular  forces  are  reflected  by  the  following:  vapor 
pressure  versus  temperature;  boiling  points  at  normal  pressure 
(normal  boiling  points,  NBPs);  critical  data;  latent  heat  of 
vaporization  versus  temperature;  viscosity;  density  and  molar 
volume;  optical  properties  such  as  the  refractive  index  and 
molecular  refractivity. 

From  all  these  clues,  the  easiest  to  measure  with  sufficient 
accuracy,  and  the  most  often  cited  for  any  compound,  is  the 
boiling  point;  usually,  the  NBP  is  cited,  but  seldom  for 


compounds  that  would  boil  at  temperatures  above  250  °C  at 
normal  pressure  because  of  decomposition.  Many  iodine 
derivatives  decompose  on  heating  even  at  lower  temperatures 
because  of  the  low  C-I  bond  energy. 

NITRILES  AND  THEIR  NORMAL  BOILING  POINTS 

The  strongly  electron-attracting  nitrile  (cyano)  group  is 
known  to  cause  high  dipole  moments.  For  example,  in  the 
gas  phase  the  dipole  moments  (in  debye  units)  are  as 


follows:1 

for  Me- 

-X 

for  Ph-X 

X  =  Cl 

1.87  D 

X  =  Cl 

1.70  D 

X  =  cf3 

2.35  D 

X  =  CF3 

2.86  D 

x  =  no2 

3.50  D 

x  =  no2 

4.21  D 

X  —  CN 

3.94  D 

X  =  CN 

4.39  D 

The  resulting  dipole— dipole  interactions  lead  to  strong 
molecular  associations,  manifested  in  higher  NBPs,  heats  of 
vaporization,  and  viscosities  than  those  of  the  corresponding 
hydrocarbons  with  comparable  molecular  weights. 

Among  thermodynamic  properties,  normal  boiling  points 
have  been  extensively  investigated  in  quantitative  structure- 
property  relationships  (QSPRs).  From  the  molecular  descrip¬ 
tors  used  in  such  correlations,  topological  indices  have  been 
among  the  most  successful.2-6  For  alkanes,  such  QSPR 
studies  allow  nowadays  the  prediction  of  NBPs  within  a 
range  of  2  or  3  °C.7-9  For  various  other  classes  of  compounds 
many  QSPR  studies  are  available,  and  their  accuracy  range 
is  often  lower  than  10  °C. 10-15 

Nitriles,  however,  proved  to  defy  simple  approaches.  Thus, 
a  recent  study  by  Wessel  and  Jurs  for  a  diverse  set  of 
industrially  important  chemicals  containing  nitrogen  with 
mean-square-root  errors  of  about  9  °C  led  to  satisfactory 
results  for  mononitriles  but  to  very  large  errors  for  two 
dinitriles,  namely,  cyanogen  and  malononitrile.12  We  have 
therefore  decided  to  look  more  closely  into  this  matter.  A 
comprehensive  review  on  malononitrile  is  available.16 
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Table  1.  Cyano  Group  as  a  Pseudohalogen:  NBPs  for  X-Y  or  X2 
Compounds0 


(pseudo)- 
halogen  X 

FW 

X-CN 

x-x 

NBP  (°C) 

FW 

NBP(°C) 

FW 

F 

19 

-72 

49 

-188 

38 

CN 

26 

-22 

60 

-22 

60 

Cl 

35 

13 

66 

-35 

71 

Br 

80 

62 

no 

56 

160 

I 

127 

184 

157 

178 

254 

0  Figures  have  been  rounded  off  to  the  nearest  integer. 


Table  2.  NBPs  of  Cyanotrihalomethanes  HahC— CN  and  of 
Tetrahalomethanes  HaljC-X  (Hal  =  F,  Cl,  Br)° 

F  -  Cl  Br 

X  NBP  (°C)  FW  NBP  (°C)  FW  NBP(°C)  FW 


F 

-128 

-88 

25 

137 

107 

271 

Cl 

-82 

104 

77 

154 

160 

287 

CN  -62 

95 

84 

149 

170 

278 

Br 

-79 

149 

104 

198 

190 

332 

I 

-23 

196 

141 

245 

a 

Figures  have  been  rounded  off  to  the  nearest  integer. 

CYANO  GROUP  AS  A  PSEUDOHALOGEN 

Groups  such  as  cyano,  thiocyano,  cyanato,  and  azido  are 
considered  to  be  pseudohalogens.17-19  In  this  paper  we  shall 
focus  only  on  the  cyano  group.  There  are  also  significant 
differences,  however,  between  some  compounds  of  halogens 
and  pseudohalogens,  for  instance  the  fact  that  hydrogen 
cyanide  is  a  much  weaker  acid  (with  pXa  =  9.2)  than 
hydrogen  halides.  Also,  the  coordinating  ability  of  the 
cyanide  anion  for  iron  leads  to  a  high  toxicity,  whereas  each 
of  the  halide  anions  has  a  different  biological  significance. 
One  should  also  recall  that  the  cyano  group  is  bidentate, 
being  able  to  form  covalent  or  coordinate  bonds  at  the 
carbon  or  nitrogen  atoms.  Thus,  the  elongated  shape  of  the 
cyano  group  makes  it  different  from  the  spherical  halogens. 

It  is  known  that  molecular  weights  have  a  large  influence 
on  NBPs.  According  to  its  formula  weight  (FW),  a  CN  group 
is  intermediate  between  a  fluorine  and  a  chlorine  atom.  On 
comparing  NBPs20-22  of  simple  halogens,  interhalogens, 
cyanogen,  or  cyanogen  halide  linear  molecules  (Table  1),  it 
can  be  seen  that  the  cyano  group  does  indeed  behave  as  a 
pseudohalogen.  On  considering  cyanogen  halides,  the  CN 
group  is  placed  by  NBPs  between  fluorine  and  chlorine. 
However,  on  comparing  NBPs  of  cyanogen  and  those  of 
elemental  halogens,  the  CN  group  is  situated  between 
chlorine  and  bromine,  as  if  the  CN  group  had  a  slightly 
higher  formula  weight. 

In  Table  2  the  NBPs  of  cyanotrihalomethanes,  X3C— CN, 
and  of  tetrahalomethanes,  CX4,  are  shown.  It  can  be  seen 
that  the  cyano  group  behaves  again  as  a  pseudohalogen 
situated  between  chlorine  and  bromine. 

Although  some  physical  data  support  the  idea  that  the  CN 
group  manifests  itself  as  a  pseudohalogen,  its  chemical 
behavior  in  organic  compounds  is  quite  different  from  that 
of  halogens.  The  C— Cl,  C— Br,  and  C— I  bond  strengths  are 
much  lower  than  the  bond  strength  of  the  C-CN  bond; 
therefore,  these  halogens  (unlike  CN  groups)  are  good 
leaving  groups.  In  the  next  section  we  shall  examine  organic 
compounds  whose  NBPs  are  much  higher  than  those  of  the 
corresponding  halogen  compounds,  so  that  the  cyano  group 
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would  be  situated  beyond  iodine;  in  such  cases,  the  notion 
of  pseudohalogen  is  no  longer  justified. 

NORMAL  BOILING  POINTS  OF  NITRILES  AND 
DINITRILES 

Mononitriles  have  NBPs  which  are  quite  high  when 
compared  with  the  corresponding  halides  (Table  3).  In  Table 
3  structures  of  halogen  derivatives  are  indicated  (in  abbrevi¬ 
ated  form)  according  to  IUPAC  nomenclature  rules;  for 
nitriles,  however,  to  achieve  consistency,  the  CN  group  is 
considered  as  a  pseudohalogen;  therefore,  the  nomenclature 
is  no  longer  according  to  IUPAC.  In  these  cases  a  CN  group 
increases  the  NBP  much  more  than  the  heaviest  stable 
halogen  atom,  namely,  iodine.  An  analogous  behavior  is 
apparent  when  comparing  halocarbonyl  or  cyanocarbonyl 
compounds  (Table  4).  Also,  the  NBPs  of  l,<w-alkanedihalides 
for  linear  alkane  chains  with  one  through  four  carbon  atoms, 
X(CH2)„X  (with  n  =  1-4)  are  much  lower  than  for  the 
corresponding  1  -alkanedinitriles  (Table  5). 

As  seen  from  Table  6  for  gem -dihalides  or  gem -dinitriles 
of  methane,  ethane,  or  propane,  a  similar  trend  with  higher 
NBPs  for  X  =  CN  than  for  X  =  Hal  is  observed;  moreover, 
one  sees  the  curious  trend  that  when  the  X  group  in  R-X  is 
I  or  CN,  the  NBPs  decrease  progressively  in  the  above  series 
with  increasing  molecular  weight,  whereas  the  corresponding 
compounds  with  X  =  F,  Cl,  or  Br  exhibit  the  reverse,  normal 
behavior.  A  break  in  Table  6  separates  the  compounds  with 
normal  and  abnormal  behavior. 

QSPR  STUDY  OF  MONOHALO  DERIVATIVES  AND  OF 
THEIR  CYANO  ANALOGUES 

For  correlating  the  chemical  structure  with  the  NBP  for 
the  data  presented  in  Table  3  we  selected  eleven  topological 
indices:  the  information  indices  IC| — IC3  and  CIC1-CIC3;23 
the  Wiener  index  W\  the  valence  connectivity  indices  °#v— 
2XV;4*24  and  the  average  distance-sum  connectivity  adapted 
for  heteroatoms  based  on  their  electronegativities  (Balaban* s 
index,  /x).25  26  All  indices  except  the  last  one  were  computed 
using  the  program  POLLY.27 

Due  to  the  fact  that  the  scale  of  the  various  topological 
indices  may  differ  by  several  orders  of  magnitude,  all  indices 
were  transformed  by  first  adding  1  to  the  index  and  then 
taking  the  natural  logarithm  of  this  result.  The  transformed 
version  of  the  indices  was  used  in  all  analyses.  The  CORR 
procedure  of  the  SAS  statistical  package32  was  used  to 
identify  intercorrelated  indices.  The  elimination  of  such 
indices  reduced  to  four  the  number  of  selected  TIs,  namely, 
IC2,  CIC2,  (£v,  and  Jx . 

An  all-subset  regression  was  accomplished  using  the  REG 
procedure  of  the  same  statistical  package,32  which  indicated 
that  lxv  and  Jx  gave  the  best  results;  IC2  and  CIC2  gave  the 
next  best  results.  The  drawback  of  IC  and  CIC  indices  is 
that  the  nature  of  the  halogen  does  not  affect  the  value  of 
these  indices. 

Experimental  and  calculated  data  for  NBPs  of  monohalo 
derivatives  with  one  through  five  carbon  atoms  and  the 
corresponding  mononitriles  with  two  to  six  carbon  atoms 
are  presented  in  Table  3,  above  the  solid  line.  Some  nitriles 
with  six  to  eight  carbon  atoms  are  also  included  below  the 
solid  line,  but  they  have  no  halogen  counterparts,  and  the 
correlations  discussed  below  do  not  include  them. 
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Table  3.  NBPs  (°C)  of  Organic  Halides  and  Nitriles  R-X  and 
QSAR  in  Terms  of  ,xv  and  Jx 


compd 

NBPexp0 

NBPcalc" 

diffexpt-calcfl 

'rb 

Jxb 

Me-F 

-78 

-81 

3 

0.3206 

0.6054 

Et-F 

-38 

-32 

-6 

0.6801 

0.9143 

Pr-F 

3 

3 

0 

0.9058 

1.0550 

Bu-F 

33 

34 

-1 

1.0899 

1.1346 

sBu-F 

25 

22 

3 

1.0685 

1.2334 

l-Cs-F 

63 

62 

1 

1.2453 

1.1860 

Me— Cl 

-24 

-23 

-1 

0.7580 

0.6152 

Et-Cl 

12 

10 

2 

0.9199 

0.9207 

Pr-Cl 

47 

47 

0 

1.1016 

1.0588 

iPr-Cl 

36 

34 

2 

1.0328 

1.1656 

Bu-Cl 

79 

79 

0 

1.2553 

1.1375 

sBu— CI 

68 

69 

-1 

1.2081 

1.2369 

iBu-Cl 

"69 

70 

-1 

1.2134 

1.2407 

tBu-Cl 

.51 

51 

0 

1.1207 

1.3635 

I-CI-C5 

108 

106 

2 

1.3885 

1.1881 

2-CI-C5 

*”  97 

97 

0 

1.3473 

1.2672 

2-Me— 1-C1— C4 

too 

100 

0 

1.3617 

1.3043 

3-Me-l-Cl— C4 

99 

98 

1 

1.3520 

1.2709 

CEt2— Cl 

98 

99 

-1 

1.3571 

1.2999 

Me-Br 

4 

10 

-6 

1.0865 

0.6403 

Et-Br 

39 

33 

6 

1.1301 

0.9357 

Pr-.Br 

71 

69 

2 

1.2798 

1.0685 

iPr-Br 

60 

56 

4 

1.1906 

1.1768 

Bu-Br 

102 

99 

3 

1.4100 

1.1445 

sBu— Br 

91 

90 

1 

1.3421 

1.2453 

iBu-Br 

91 

97 

-6 

1.3742 

1.2479 

tBu-Br 

73 

77 

-4 

1.2476 

1.3727 

l-Br-Cs 

130 

125 

5 

1.5252 

1.1936 

2-Br-Cj 

117 

116 

1 

1.4649 

1.2737 

2-Me-Br— 1-C4 

121 

125 

-4 

1.5019 

1.3100 

3-Me— 1-Br— C4 

120 

122 

-2 

1.4934 

1.2765 

CEt2-Br 

119 

119 

0 

1.4736 

1.3070 

Me—I 

43 

51 

-8 

1.2627 

0.6689 

Et— I 

73 

66 

7 

1.2528 

0.9532 

Pr-I 

103 

100 

3 

1.3863 

1.0801 

iPr-I 

90 

87 

3 

1.2862 

1.1900 

Bu-I 

131 

127 

4 

1.5041 

1.1531 

sBu— I 

120 

117 

3 

1.4248 

1.2553 

iBu-I 

121 

127 

-6 

1.4716 

1.2568 

tBu-I 

100 

106 

-6 

1.3265 

1.3833 

I-I-C5 

155 

150 

5 

1.6094 

1.2000 

2-I-C5 

141 

141 

0 

1.5384 

1.2818 

2-Me-l-I— C4 

148 

153 

-5 

1.5880 

1.3169 

3-Me— 1-1— C4 

147 

149 

-2 

1.5802 

1.2829 

CEt2— I 

146 

145 

1 

1.5465 

1.3156 

Me-CN 

82 

71 

11 

0.5446 

1.2196 

Et-CN 

97 

104 

-7 

0.8259 

1.2173 

Pr-CN 

118 

126 

-8 

1.0239 

1.2366 

iPr-CN 

104 

109 

-5 

0.9810 

1.3592 

Bu-CN 

141 

143 

-2 

1.1891 

1.2565 

sBu-CN 

125 

128 

-3 

1.1647 

1.3880 

iBu-CN 

131 

129 

2 

1.1442 

1.3483 

tBu-CN 

106 

108 

-2 

1.0899 

1.5065 

l-CN-Cs 

164 

158 

6 

1.3308 

1.2737 

2-CN-Cs 

146 

145 

1 

1.3097 

1.3888 

2-Me— 1-CN— C4 

154 

147 

7 

1.2920 

1.3431 

3-Me— 1-CN— C4 

157 

158 

-1 

1.3308 

1.2737 

CEt2— CN 

146 

142 

4 

1.3199 

1.4339 

EtCMe2— CN 

129 

126 

3 

1.2624 

1.5304 

1-CN— Ce 

183 

171 

12 

1.4549 

1.2881 

2-CN-Q 

164 

160 

4 

1.4363 

1.3840 

3-Me— 1-CN— C5 

172 

158 

14 

1.4298 

1.3992 

4-Me— 1-CN— C5 

180 

159 

21 

1.4298 

1.3830 

5-Me-l-CN-C5 

180 

158 

22 

1.3308 

1.2737 

I-CN-C7 

199 

183 

16 

1.5653 

1.3002 

a  Figures  have  been  rounded  off  to  the  nearest  integer.  *•  Topological 
indices  lxv  and  Jx  are  expressed  by  converting  their  values  (y)  into 

ln(l  +  y). 
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Table  4.  NBPs  of  Halocarbonyl  Derivatives  (Iodine  Derivatives 
Are  Not  Available)0 


NBP(°C) 

X 

EtOCOX 

C1COX 

F 

57 

-45 

CI 

95 

8 

Br 

116 

25 

CN 

116 

128 

0  Figures  have  been  rounded  off  to  the  nearest  integer. 

Table  5.  NBPs  of 

l, w -Dihalides  and  l,ft)-Biscyanides  of  Linear 

Alkanes  Ci 

— C4° 

NBP  (°C) 

X 

xch2x 

X(CH2)2X 

X(CH2)3X 

X(CH2)4X 

F 

-52 

31 

42 

78 

Cl 

40 

84 

121 

154 

Br 

97 

131 

167 

197 

I 

181 

200 

227 

CN 

219 

266 

286 

295 

tt  Figures  have  been  rounded  off  to  the  nearest  integer. 


Table  6.  NBPs  of  gem-Bis(pseudo)halides  of  Alkanes  Q— C3fl 


X 

NBP  (°C) 

CH  2X2 

MeCHX2 

Me2CX2 

F 

-52 

-25 

0 

Cl 

40 

58 

71 

Br 

97 

113 

115 

1 

181 

178 

148 

CN 

219 

198 

170 

0  Figures  have  been  rounded  off  to  the  nearest  integer. 


A  comment  on  how  the  l£v  and  Jx  indices  vary  with 
increasing  size  and  branching  of  molecules  needs  to  be 
added.  Both  these  indices  increase  with  increasing  size.  The 
nature  of  the  halogen  X  in  R— X  molecules  with  die  same  R 
group  also  leads  to  a  progressive  increase  in  the  series  F, 
Cl,  Br,  and  I;  this  increase  is  steep  for  1%V  but  moderate  for 
Jx .  However,  increasing  branching  of  the  R  group  for 
isomeric  molecules  leads  to  decreasing  values  for  1%V  but  to 
increasing  values  for  Jx.  Of  course,  as  a  general  rule, 
experimental  NBPs  increase  with  increasing  size  and  mo¬ 
lecular  weight  of  molecules  and  decrease  with  molecular 
branching;  only  poly(fluoroalkanes)  are  exceptions  to  this 
rule,  as  mentioned  earlier.13 

The  corresponding  equations  are  shown  in  Table  7a, b  with 
the  statistical  parameters.  For  the  chloro  derivatives  Jx  was 
not  a  significant  parameter,  so  that  a  monoparametric 
equation  in  terms  of  lxv  Save  this  case  satisfactory  results. 
For  all  other  compounds  from  Table  3,  such  monoparametric 
equations  led  to  worse  results  than  those  presented  in  both 
parts  a  and  b  of  Table  7.  Intercorrelation  factors  between 
the  four  selected  indices  are  presented  in  Table  8;  one  can 
see  that  no  significant  intercorrelation  is  present.  It  can  be 
observed  from  Tables  3  and  7a  that  the  correlation  for  nitriles 
is  slightly  poorer  than  for  the  halogens;  however,  the 
agreement  between  the  experimental  and  calculated  NBPs 
is  quite  good.  Remarkably,  the  coefficients  of  the  l#v 
parameter  are  similar  for  Br  and  I  in  Table  7a  and  for  all 
halogens  in  Table  7b;  this  fact  is  reminiscent  of  the 
observation  presented  in  the  earlier  paper13  about  the  fact 
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Tabic  7.  Correlation  Equations  for  NBP  and  Statistical  Parameters 


(a)  In  Terms  of  lxv  and  Jx 

NBP 

s 

r 

F 

RF 

(208 

± 

23 )Y 

-  (84.0  ±  32)7, 

-  (96.8  ± 

15) 

4.3 

0.998 

355 

RC1 

(204 

± 

2>'r  - 

-(177  ±2) 

1.4 

0.999 

9444 

RBr 

(203 

± 

i2)'r 

+  (45.6  ±  9.3)7, 

-  (239  ± 

12) 

4.5 

0.994 

404 

R1 

(195 

± 

isi'r 

+  (59.3  db  10)7, 

-  (235  ± 

17) 

5.3 

0.989 

235 

RCN 

(117 

db 

8.4)V 

'  -  (92.5  ±  22)7, 

+  (120± 

27) 

6.1 

0.976 

99 

(b)  In  Terms  of  IC2  and  CIC2 


NBP 

s 

r 

F 

RF 

(223  ±  27)IC2  +  (197  ±  47)CIC2  -  (406  ±  38) 

10.2 

0.988 

62 

RC1 

(230  ±  16)IC2  +  (146  ±  16)CIC2  -  (333  ±  27) 

9.0 

0.978 

109 

RBr 

(218  ±  16)IC2  +  (135  ±  16)CIC2  -  (286  ±  27) 

8.7 

0.977 

104 

RI 

(198  ±  15)IC2  +  (1 15  ±  15)CIC2  -  (217  ±  25) 

8.4 

0.974 

92 

RCN 

(176  ±  30)IC2  +793.9  ±  20)CIC2  -  (168  ±  42) 

11.4 

0.914 

25 

Table  8.  Intercorrelation  Matrix  for  the  Four  Selected  TIs" 

•r  J,  IC2 

CIC2 

V 

'  1.000  0.702  0.802 

0.178 

Jx 

1.000  0.451 

0.655 

IC2  1.000  -0.331 

CIC2  1.000 

*  Topological  indices  'xv  and  J,  are  shown  by  converting  their  values 


Experimental  NOP 


4R-F  ■R-Cl  A  R-Br  e  R-l  | 

Figure  1.  Plot  of  the  predicted  NBP  versus  the  experimental  NBP 
for  the  combined  set  of  45  monohaloalkanes  from  Table  3  in  terms 
of  two  TIs,  namely,  lxv  and  /». 

that  one  might  consider  a  “generalized  halogen”  with  a 
stepwise  increment  for  the  four  halogens  F,  Cl,  Br,  and  I. 
Though  the  aim  of  the  present  paper  was  to  discuss  nitriles 
and  not  haloderivatives  (the  NBPs  of  these  last  compounds 
were  the  object  of  a  QSPR  study  in  the  earlier  paper13),  one 
can  use  the  same  parameters  as  in  Table  7a  for  a  correlation 
of  NBPs  for  all  45  halogen  derivatives  presented  in  Table  3 
according  to  the  following  equation: 

NBP  =  (180  ±  7.8)V  +  (34  ±  10)/,  -  (189  ±  9.2) 
j  =  10  °C  r  =  0.9823  F  =  579 

The  diagram  shown  for  this  correlation  in  Figure  1 
indicates  that  only  2-butyl  fluoride  and  three  halomethanes 
with  F,  Br,  and  I  have  deviations  above  14  °C  between 
observed  and  predicted  NBPs. 

Interestingly,  the  last  equation  of  Table  7a  works  even 
for  other  aliphatic  mononitriles  with  six  to  eight  carbon 
atoms,  presented  at  the  bottom  of  Table  3  below  the  full 
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Table  9.  NBPs  (°C)  of  Unsaturated  Nitriles  and  QSAR  in  Terms  of 
IC2  and  CIC2 


nitrile 

IC2 

CIC2 

NBPcxp 

NBPcaic 

diffcxpl-calctl 

C=CC#N 

1.2590 

0.2515 

78 

80 

-2 

C#CC#N 

1.2006 

0.0000 

43 

40 

3 

C=CCC#N 

1.3666 

0.3365 

119 

112 

7 

CC—  CC#N 

1.2936 

0.5158 

113 

116 

-3 

CC(-C)C#N 

1.2936 

0.5158 

91 

116 

-25 

CC=C(C)C#N 

1.2548 

0.7853 

138 

137 

1 

CC(C)=  CC#N 

1.2102 

0.8531 

141 

135 

6 

C— CCCC#N 

1.3689 

0.5704 

140 

138 

2 

CCC=CC#N 

1.3930 

0.5146 

136 

137 

-1 

C=CC=CC#N 

1.3468 

0.4787 

137 

123 

14 

CCC(C)=CC#N 

1.3625 

0.7391 

142 

155 

-13 

CC(C)C=CC#N 

1.3300 

0.7971 

155 

155 

0 

CC(C)— CCC#N 

1.3300 

0.7971 

166 

155 

11 

line;  however,  in  these  cases,  all  calculated  values  are  lower 
than  the  experimental  ones. 

Unsaturation  in  the  nitrile  chain  lowers  appreciably  the 
NBP,  as  seen  in  Table  8.  Using  the  same  descriptors  as  in 
Table  7b  for  these  nitriles  with  three  to  six  carbon  atoms 
having  one  or  two  double  bonds  or  one  triple  bond  (denoted 
by  #  in  Table  9  which  uses  Smiles  notation  for  structures), 
the  QSAR  results  presented  in  Table  9  were  obtained  with 
the  following  equation: 

NBP  =  (214  ±  52)IC2  +  (109  ±  13)CIC2  -  (217  ±  67) 
J  =  11°C  r  =  0.9121  F  =  52 

A  GUESSING  GAME 

On  addressing  an  audience  of  chemists,  the  following 
guessing  game  was  proposed:  the  audience  was  given  the 
NBPs  of  the  l,(y-alkanedinitriles  X(CH2)„X  with  n  =  1-4, 
namely,  malononitrile,  succinonitrile,  adiponitrile,  and  ca- 
prononitrile  (i.e.,  the  last  line  in  Table  5).  Then  everyone 
was  asked  to  guess  the  NBP  temperature  interval  for 
oxalonitrile  (the  compound  with  n  —  0)  by  putting  a  mark 
in  one  of  the  following  eight  intervals:  <-20;  -20  to  +20; 
+20  to  +60;  +60  to  +100;  +100  to  +140;  +140  to  +180; 
+180  to  +220;  and  >+220  °C.  Remarkably,  no  member  of 
the  audience  guessed  that  oxalonitrile  (cyanogen  with  NBP 
—  —22  °C)  should  appear  in  the  first  temperature  interval 
(NBP  <  —20  °C).  The  other  seven  temperature  intervals  were 
about  equally  populated  with  marks. 

LARGEST  INCREMENT  IN  NBP  FOR  A  HOMOLOGOUS 
SERIES 

The  two  compounds  (cyanogen  and  manononitrile)  men¬ 
tioned  to  be  outliers  in  the  QSPR  study  cited  earlier12 
represent  the  pair  with  the  largest  NBP  increment  on  adding 
one  methylene  group,  as  seen  from  Table  10.  In  this  table, 
one  compares  the  next  two  homologues  having  various 
simple  groups  bonded  either  directly  (R2)  or  via  a  methylene 
group  (RCH2R),  where  R  can  be  a  halogen,  a  cyano  group, 
an  alkyl,  an  alkoxy,  or  an  organic  electronegative  group. 
Breaks  in  the  table  delineate  various  related  classes  of 
compounds. 

The  first  entry  of  the  above  two  compounds  constitutes  a 
class  by  itself.  The  huge  difference  of  241  °C  between  the 
NBPs  of  cyanogen  (oxalonitrile,  with  NBP  =  -22  °C)  and 
malononitrile  (with  NBP  =  219  °C)  can  be  explained  by 
the  fact  that  cyanogen  has  a  linear  geometry  and  hence  a 
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Table  10.  Differences  in  NBPs  for  Compounds  Differing  by  One 
Methylene  Group0 


R 

NBP  (°C) 

Rz 

RCH2R 

diff 

CN 

-22 

219 

241 

H 

-253 

-162 

91 

cf3 

-78 

1 

79 

CH3-CO 

88 

138 

50 

CH3 

-89 

-42 

47 

HC— C 

10 

55 

45 

F 

-188 

-52 

136 

Cl 

-35 

40 

75 

Br 

56 

97 

41 

I 

184 

182 

-2 

Et 

0 

37 

37 

MeO 

14 

42 

28 

EtO 

63 

88 

25 

MeS 

110 

149 

39 

EtS 

154 

181 

27 

CC13 

186 

206 

20 

COOMe 

163 

181 

18 

COOEt 

185 

199 

14 

COOPr 

211 

229 

18 

COOBu 

242 

256 

14 

Ph 

256 

264 

12 

a  Figures  have  been  rounded  off  to  the  nearest  integer. 

zero  dipole  moment,  whereas  malononitrile  is  a  V-shaped 
molecule  with  a  high  dipole  moment,  3.58  D.28*29  The 
calculated  polarizability  of  malononitrile  is  abnormally  high 
in  comparison  with  calculated  values.30*31 

A  few  other  comments  in  Table  10  should  be  added.  The 
first  nine  entries  show  differences  in  NBPs  that  are  higher 
than  40  °C  for  the  two  homologues.  Among  these,  the  first 
six  have  electronegative  or  slightly  electron-donating  groups; 
the  next  class  includes  the  four  stable  halogens,  and  the  trend 
in  this  group  with  progressively  decreasing  electronegativity 
is  quite  interesting,  starting  with  the  next  highest  NBP 
difference  in  the  whole  table  (for  fluorine)  and  ending  with 
a  negative  difference  (for  iodine).  All  these  entries  have  linear 
R.2  and  bent  R2CH2  molecules  for  the  two  homologues, 
respectively. 

The  last  class  with  NBP  differences  lower  than  40  °C, 
however,  demonstrates  that  electronegativity  by  itself  does 
not  provide  a  full  explanation  for  the  data  contained  in  Table 
10.  Indeed,  here  again  we  encounter  groups  with  electron- 
donating  as  well  as  with  electron-accepting  properties. 
However,  in  this  class  the  R2  molecules  have  no  longer  linear 
geometries  except  for  biphenyl  and  hexachloroethane. 

OTHER  DINITRILES 

A  comparison  between  volatilities  of  dinitriles  of  four- 
carbon  dicarboxylic  acids  is  interesting,  despite  the  incom¬ 
pletely  matched  data.  Succinonitrile  has  a  NBP  of  266  °C 
and  a  dipole  moment  of  3.93  D.  From  the  two  stereoisomeric 
olefinic  congeners,  the  dinitrile  of  fumaric  acid  with  E - 
geometry  is  more  volatile  (NBP  of  186  °C,  subliming  even 
under  100  °C)  than  the  dinitrile  of  maleic  acid  (with  a  higher 
dipole  moment  because  of  its  Z-geometry)  which  has  a  BP 
of  111  °C  at  20  Torr  and  99  °C  at  13  Torr.  The  alkynic 
congener  which  has  a  linear  geometry  and  zero  dipole 
moment  (dicyanoacetylene  or  acetylenedicarbonitrile,  C4N2) 
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has  a  NBP  of  only  77  °C  and  sublimes  easily.  Interestingly, 
the  dinitrile  C6N2  of  hexadiynedioic  acid  with  two  triple 
bonds  (with  linear  geometry)  has  a  NBP  of  only  154  °C. 

Isomers  of  benzodinitrile  also  have  volatilities  that  attest 
the  importance  of  dipole  moments:  phthalonitrile  with  the 
highest  dipole  moment  has  at  10  Torr  a  boiling  point  of  151 
°C;  isophthalonitrile  with  a  dipole  moment  which  is  about 
half  as  large  has  the  BP  of  140  °C  at  the  same  reduced 
pressure;  and  terephthalonitrile  with  a  zero  dipole  moment 
sublimes  at  normal  pressure  at  temperatures  starting  at  153 
°C. 

When  the  CN  group  is  attached  to  an  electron-acceptor 
substituent,  the  polarity  of  the  bond  is  low  and  the  NBP  is 
within  the  range  expected  for  a  pseudohalogen  with  a  formula 
weight  close  to  that  of  chlorine.  However,  when  the  CN 
group  is  bonded  to  an  electron-donor  substituent,  the  high 
polarity  of  the  resulting  bond  enhances  appreciably  the  NBP. 
The  conclusion  is  that  NBPs  are  the  result  of  a  multiplicity 
of  factors  inherent  in  determining  the  intermolecular  forces 
that  exist  in  the  liquid  state.  In  certain  cases  such  as  the  two 
homologous  dinitriles  with  two  and  three  carbon  atoms, 
QSPR  studies  should  not  ignore  differences  between  these 
intermolecular  interactions. 
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Abstract 

A  novel  QSAR  study  of  benzamidines  complement-inhibitory  activity  and  benzene  derivatives  acute  toxicity  is 
reported  and  a  new  efficient  method  for  selecting  descriptors  is  used.  Complement-inhibitory  activity  QSAR  models 
of  benzamidines  contain  from  one  to  five  descriptors.  The  best,  according  to  fitted  and  cross-validated  statistical 
p,arafimetf’  1S  shown  to  be  the  five-descriptor  model.  Models  with  a  higher  number  of  indices  did  not  improve  over 
the  five-descriptor  model.  The  benzene  derivatives  structure-toxicity  models  involve  up  to  seven  linear  descriptors. 
Multiregression  models,  containing  up  to  tenjionlinear  descriptors,  are  also  reported  for  the  sake  of  comparison  with 
previously  obtained  additivity  models.  Comparison  with  benzamidine  complement-inhibitory  activity  models  and 
with  benzene  derivatives  toxicity  models  from  the  literature  favors  our  novel  approach.  ©  2000  Elsevier  Science 
Ireland  Ltd.  All  rights  reserved. 
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with  the  same  aim,  of  a  new  efficient  approach  for 
selecting  the  best  QSAR  models  using  multivariate 
regression  (MR)  (Lucic  and  Trinajstic,  1999;  Lucic  et 
al.,  1999a)  and  a  standard  approach  for  variable  selec¬ 
tion  and  model  generation  used  in  CODESSA  (Ka- 
tritzky  et  al.,  1999;  Lucic  et  al.,  1999b).  Sometime  ago 
Hansch  and  Yoshimoto  (Hansch  and  Yoshimoto,  1974) 
carried  out  a  QSAR  study  on  the  complement-in¬ 
hibitory  potency  of  benzamidines  using  their  own  ap¬ 
proach.  After  10  years,  Hall  et  al.  (Hall  et  al.,  1984) 
carried  out  a  QSAR  study  on  the  toxicities  of  benzene 
derivatives  using  de  novo  analysis  (Free  and  Wilson, 
1964;  Kubinyi  and  Kehrhahn,  1976),  and  derived  an 
additivity  model  for  66  compounds  (they  excluded  three 
compounds  as  outliers).  We  will  analyze  their  models 
and  compare  to  ours. 

0097-8485/00/S  -  sec  front  matter  ©  2000  Elsevier  Science  Ireland  Ltd.  All  rights  reserved. 

PH:  S0097-848 5 (99)00059-5 


1.  Introduction 

In  our  recent  papers  a  hierarchical  QSAR  (quantita¬ 
tive  structure-activity  relationship)  approach  was  used 
to  model  the  complement-inhibitory  activity  of  benza¬ 
midines  (Basak  et  al.,  1999a)  and  the  acute  aquatic 
toxicities  of  benzene  derivatives  (Gute  and  Basak,  1997; 
Basak  et  al.,  1999c).  The  hierarchical  QSAR  approach 
uses  topological  (partitioned  into  topostructural  and 
topochemical),  geometric  and  quantum-chemical  de¬ 
scriptors  in  a  stepwise  fashion  to  build  increasingly 
more  complex  structure-property -activity  models 
(Basak  et  al.,  1997,  1999b).  Now  we  report  the  use. 
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Bcnzamidincs  arc  inhibitors  of  the  complement  sys¬ 
tem.  Complement  is  a  system  of  factors  occurring  in 
normal  serum  which  arc  characteristically  activated  by 
antibody-antigen  interactions  and  which  subsequently 
mediate  a  number  of  biologically  significant  conse¬ 
quences.  The  factors  of  the  complement  system  include 
at  least  20  chemically  distinct  scrum  proteins  and  glyco¬ 
proteins.  These  factors  which  normally  exist  in  an 
inactive  form,  may  be  activated  by  two  (classical  and 
alternative)  pathways.  Both  pathways  generate  macro- 
molecular  membrane  attack  complexes  which  lyse  a 
variety  of  cells,  bacteria  and  viruses  (Kuby,  1992). 
Products  of  this  activation  result  in  inflammatory  reac¬ 
tions  at  the  site  of  antibody-antigen  interaction.  This  is 
especially  pronounced  in  the  case  of  organ  specific  and 
systemic  autoimmune  disorders.  Therefore,  control  of 
unregulated  complement  activation  is  essential,  espe¬ 
cially  in  the  case  of  autoimmune  disease. 

Acute  aquatic  toxicities  of  benzene  derivatives  in  the 
fathead  minnow  ( Pimephales  promelas)  indicate  96-h 
values  ranging  from  3.0  to  6.4  log  units  for  the  LC50 
(lethal  dose  to  50%  of  the  sample).  Details  about  LC50 
measurements  are  given  in  the  report  by  Hall  et  al. 
(Hall  et  al.,  1984). 


2.  Data  sets 

2.1 .  Bcnzamidincs 

In  Fig.  1  we  give  the  structural  formula  of  benza- 
midines  and  in  Table  l  the  side-chain  structures  and 
experimental  complement-inhibitory  activities  in  terms 
of  1/log  C  for  studied  benzamidines.  C  in  log  C  is  the 
micromolar  concentration  of  inhibitor  required  for  50% 
inhibition  of  lyophilized  guinea  pig  complement  when 
assayed  in  buffer  (Hansch  and  Yoshimoto.  1974). 

2.2 .  Benzene  derivatives 

Toxicity  data  of  69  benzene  derivatives  are  taken 
from  Hall  et  al.  (Hall  et  al.,  1984).  Toxicity  data 
reported  by  Hall  et  al.  consists  of  26  original  experi- 
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Fig.  1.  Structural  formula  of  bcnzamidincs. 


mental  observations  and  43  taken  from  seven  different 
sources.  Thus,  the  studied  set  of  benzene  derivatives 
contains  toxicities  of  68  compounds  and  benzene.  The 
benzene  derivatives  in  this  set  have  seven  diflercnt 
substituents;  each  substituent  being  present  in  al  least 
six  compounds.  These  substituents  are  amino,  bromo, 
chloro,  hydroxyl,  methyl,  methoxyl  and  nitro  groups. 
Studied  benzene  derivatives  arc  listed  in  Table  2.  Their 
toxicities  are  expressed  as  the  negative  logarithm  oi  the 
lethal  concentration  of  a  benzene  derivative  and  de¬ 
noted  by  —  log(LC50). 

2.3.  Molecular  descriptors 

In  Table  3  are  given  symbols  and  brief  description  of 
descriptors  that  are  used  for  the  QSAR  modeling  of 
benzamidines  and  benzene  derivatives  in  the  present 
work.  The  total  number  of  descriptors  is  110  (40  to- 
postructural,  61  topochemical,  three  geometric  and  six 
quantum-chemical  descriptors).  In  the  previous  QSAR 
study  of  benzamidines  (Basak  et  al.,  1999a)  95  descrip¬ 
tors  were  used  (37  topostructural,  55  topochemical  and 
three  geometric).  The  difference  is  caused  by  a  fact  that 
nine  topological  descriptors  possess  zero  values  (we 
included  them  in  our  set  simply  to  have  the  complete 
set  of  descriptors)  for  all  molecules  studied  and  six 
quantum-chemical  descriptors  were  not  included  in  the 
previous  modeling.  All  topological  descriptors  were 
transformed  as  it  was  done  Basak  et  al.  (Basak  et  al., 
1999a)  using  a  natural  logarithmic  transformation  of 
the  form  ln(x+  1),  where  ,v  represents  single  values  of 
descriptors.  This  was  done  to  avoid  errors  in  rounding 
up  numerical  values  because  the  range  of  descriptor 
values  was  rather  large.  The  geometric  descriptors  were 
transformed  by  the  natural  logarithm  of  the  descriptor 
for  consistency. 

In  the  case  of  benzene  derivatives  we  used  the  same 
set  of  descriptors  as  Gute  and  Basak  (Gute  and  Basak, 
1997)  and  Basak  et  al.  (Basak  et  al.,  1999c).  They  were 
transformed  in  the  same  way  as  the  benzamidine  data 
set  (see  Basak  et  al.,  1999a). 

2.4.  Variable  selection  and  models  generation 

To  obtain  the  best  possible  QSAR  models  with  l 
(7—1,  2,  3,  ...)  descriptors  we  used  a  computational 
approach,  detailed  elsewhere  (Lucic  and  Trinajstic, 
1999),  by  which  one  can  select  the  best  MR  model  with 
7  descriptors  from  the  set  of  N  descriptors.  The  number 
of  possible  models  with  7  descriptors  is  A'!/(yV—  7)!7! 
The  quality  of  each  model  (with  I  descriptors)  was 
identified  with  its  correlation  coefficient  (R),  and 
among  all  possible  models  the  best  one  was  selected, 
with  the  highest  value  of  R.  To  be  able  to  check  the 
quality  of  a  large  number  of  MR  models,  it  was  neces¬ 
sary  to  develop  a  very  fast  procedure  lor  calculating  R. 
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Table  1 


Observed  and  calculated  (cross-validated,  CV,  and  fitted,  FIT)  complement-inhibitory  activities  I/log  C  of  105  benzamidines 


No. 

X 

I/log  C 

Observed 

Calculated  (CV)a 

Calculated  (FIT);‘ 

1 

2-CH, 

-0.444 

-0.417 

-0.419 

2 

3,4-(CHj)2 

-0.425 

-0.423 

-0.424 

3 

H 

-0.418 

-0.424 

-0.423 

4 

3-OH 

-0.415 

-0.439 

-0.434 

5 

3-CF, 

-0.410 

-0.378 

-0.382 

6 

3-N02 

-0.410 

-0.392 

-0.395 

7 

3-Br 

-0.405 

-0.399 

-0.400 

8 

3-CH, 

-0.398 

-0.399 

-0.399 

9 

3-OCH, 

-0.397 

-0.401 

-0.401 

10 

3-CHjC6H5 

-0.373 

-0.343 

-0.346 

11 

3,5-(CH,), 

-0.361 

-0.375 

-0.369 

12 

3-OC3H7 

-0.355 

-0.358 

-0.358 

13 

3-/'-C3Hm 

-0.355 

-0.344 

-0.345 

14 

3~OC4H9 

-0.351 

-0.340 

-0.341 

15 

3-C4H9 

-0.338 

-0.355 

-0.353 

16 

3-CH=CHC6H5 

-0.339 

-0.324 

-0.325 

17 

3-OCH-,C6H5 

-0.331 

-0.324 

-0.324 

18 

3-(CH2)2C6Hs 

-0.330 

-0.332 

-0.331 

19 

3-OC6H|3 

-0.329 

-0.318 

-0.319 

20 

3-0(CH2)40C6Hs 

-0.325 

-0.286 

-0.287 

21 

3-0(CH2)2OC6Hs 

-0.323 

-0.314 

-0.315 

22 

3-C6Hs 

-0.323 

-0.366 

-0.359 

23 

3-0(CH2),0C6H4-4-C00H 

-0.321 

-0.296 

-0.297 

24 

3-OC5Hu 

-0.320 

-0.327 

-0.326 

25 

3-0-/-C5Hn 

-0.318 

-0.338 

-0.335 

26 

3-0(CH2)20C ,  0H,-ot 

-0.312 

-0.255 

-0.262 

27 

3-0(CH2)40C6H4-4-NH, 

-0.306 

-0.288 

-0.289 

28 

3-{CH2)4C6H5 

-0.302 

-0.315 

-0.313 

29 

3-0(CH2)30C6H4-4-N0,  . 

-0.301 

-0.282 

-0.282 

30 

3-0(CH2),0C6H4-4-NH, 

-0.300 

-0.298 

-0.298 

31 

3-(CH2)2-4-C5H4N 

-0.299 

-0.318 

-0.318 

32 

3-0(CH2)20C6Hs 

-0.299 

-0.295 

-0.295 

33 

3-0(CH2)jC6H5 

-0.296 

-0.290 

-0.290 

34 

3-(CH2)2-3-C5H4N 

-0.294 

-0.298 

-0.298 

35 

3-(CH2)4C6H4-4-NHAc 

-0.294 

-0.281 

-0.282 

36 

3-(CH2)2-2-C5H4N 

-0.291 

-0.300 

-0.299 

37 

3-0(CH2)j0C6H4-2-NH, 

-0.283 

-0.288 

-0.288 

38 

3-0(CH2)30C6H4-4-NHAc 

-0.278 

-0.270 

-0.270 

39 

3-(CH2)4-3-C5H4N 

-0.276 

-0.284 

-0.284 

40 

3-0(CH2)4C6Hs 

-0.276 

-0.277 

-0.277 

41 

3-0(CH2)30C6H4-3-NHAc 

-0.270 

-0.260 

-0.260 

42 

3-0(CH2)30QH3-3,4-Cl2 

-0.265 

-0.271 

-0.271 

43 

3-0(CH2)30QH4-3-NH,’ 

-0.265 

-0.283 

-0.283 

44 

3-0(CH2)3OC6H4-2-NHCOC6H4-4-SO,F 

-0.265 

-0.247 

-0.247 

45 

3-0(CH2)30C6H4-2-NHC0C6H5 

-0.265 

-0.258 

-0.258 

46 

3-0(CH2)30C6H4-4-0CH , 

-0.262 

-0.275 

-0.274 

47 

3-0(CH2)40C6H4-4-NHC0NHQH4-4-S02F 

-0.260 

-0.236 

-0.237 

48 

3-0(CH2)10C6H4-2-NHC0C6H3-2-0CH3-5-S0,F 

-0.260 

-0.226 

-0.227 

49 

3-0(CH,)30C6H4-4-Cl 

-0.257 

-0.287 

-0.286 

50 

3-0(CH2),0C6H4-2-N0, 

-0.257 

-0.279 

-0.279 

51 

3-0(CH2),OQ,H4-3-NO, 

-0.257 

-0.268 

-0.268 

52 

3-0(CH2),0C6H4-3-0CH, 

-0.256 

-0.255 

-0.255 

53 

3-0(CH2)10C6H4-2-NHC0Cf,H,-2-Cl-6-S01F 

-0.255 

-0.247 

-0.248 

54 

3-0(CH2),0C6H4-2-NHC0NHC6H5 

-0.255 

-0.260 

-0.259 

55 

3-0(CH,),OC6H4-2-NHCONHC(,H,-2-CI-5-SO,F 

-0.250 

-0.246 

-0.246 
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Table  1  (Continued) 


No. 

X 

1  /log  C 

Observed 

Calculated  (CWf  Calculated  (F1T)“ 

56 

3-0(CH,),OC6H4-2-NHCONHCH,C,,H4-4-SO,F 

-0.250 

-0.232 

-0.232 

57 

3-0(CH2)3OC(,H4-2-NHCONII-C(,H-,-2,4-(CH3"),-5-SO,F 

-0.248 

-0.242 

-0.242 

58 

3-0(CH,),0C(iH4-4-C00CH  , 

-0.247 

-0.261 

-0.261 

59 

3-0(CH,),OQ,H,-3-NO,-4-CH , 

-0.245 

-0.268 

-0.267 

60 

3-0(CH:),0C6H4-3-CF, 

-0.245 

-0.276 

-0.275 

61 

3-0(CH,),OC„H4-2-NHCONHC(,H4-4-CH,-3-SO,F 

-0.245 

-0.232 

-0.232 

62 

3-0(CH,),0C6H,-4-NHC0C6H, 

-0.244 

-0.242 

-0.242 

63 

3-0(CH2),OC6H4-2-NHCOCH,OC„H4-4-SO,F 

-0.244 

-0.239 

-0.239 

64 

3-0(CH,)3OC6H4-4-NHCOC6H4-4-OCH, 

-0.243 

-0.228 

-0.229 

65 

3-0(CH2)30C6H4-2-NHC0Q,H4-3-S0,F 

-0.243 

-0.234 

-0.234 

66 

3-0(CH,),OC6H4-2-NHCOCH,QH4-4-SO,F 

-0.243 

-0.242 

-0.242 

67 

3-0(CH,),OC6H4-3-COOCH, 

-0.242 

-0.256 

-0.256 

68 

3-0(CH,)j0C6H4-2-NHC0(CH-,)1C6H4-4-S0,F 

-0.242 

-0.232 

-0.232 

69 

3-0(CH2)30C6H4-4-NHC0QH4-4-N02 

-0.239 

-0.234 

-0.234 

70 

3-0(CH,),OC6H4-2-NHCOC6H4-4-NO, 

-0.239 

-0.248 

-0.248 

71 

3-0(CH2)3OC6H4-4-NHCONHQHs 

-0.237 

-0.252 

-0.252 

72 

3-0(CH2)j0C6H4-4-NHC0C(,H4-3-N02 

-0.237 

-0.225 

-0.225 

73 

3-0(CH2)jOC6H4-2-NHCO(CH,)4C6H4-4-SO,F 

-0.237 

-0.220 

-0.221 

74 

3-0(CH,)30CcH4-2-NHC0NHC6H44-S0,F 

-0.237 

-0.248 

-0.248 

75 

3-0(CH2)jOC6H4-3-NHCONHC6H4-4-SO,F 

-0.236 

-0.231 

-0.231 

76 

3-0(CH2),OC6H4-2-NHCONH(CH,),C6H4-4-SO,F 

-0.236 

-0.224 

-0.224 

77 

3-0(CH2)4OC6H4-3-NHCOC6H4-4-SO,F 

-0.236 

-0.222 

-0.222 

78 

3-0(CH2)jOC6H4-2-NHCONHC6H,-4-CI-3-SO,F 

-0.235 

-0.236 

-0.236 

79 

3-0(CH2)4OC6H4-2-NHCOC,,H3-4-CH3-3-SO,F 

-0.235 

-0.229 

-0.229 

80 

3-0(CH2),OC6H4-2-NHCOC6Hr2,4-(CH,),-5-SO,F 

-0.234 

-0.238 

-0.237 

81 

3-0(CH2),OC6H4-2-NHCOQH,-2,4-Cl,-5-SO,F 

-0.234 

-0.243 

-0.243 

82 

3-(CH2)4C6H4-2-NHC0NHC6H4-3-S02F 

-0.234 

-0.247 

-0.246 

83 

3-0(CH2)3OC6H4-3-NHCOC6H4-4-OCH3 

-0.233 

-0.219 

-0.219 

84 

3-(CH2)4C6H4-2-NHCONHQ,H4-4-SO,F 

-0.233 

-0.263 

-0.261 

85 

3-0(CH,)3OC6H4-4-NHCOC6H4-4-Cl  ‘ 

-0.232 

-0.238 

-0.238 

86 

3-0(CH2)3OQH4-2-NHCOC6H3-2-CH3-5-SO,F 

-0.232 

-0.234 

-0.234 

87 

3-0(CH2)40C6H4-4-NHC0NHC6H,-2-0CH3-5-S0.F 

-0.232 

-0.214 

-0.215 

88 

3-0(CH2),OC6H4-4-C6H  5 

-0.230 

-0.256 

-0.254 

89 

3-0(CH2)j0C6H4-2-NHC0NHC6H4-3-S0,F 

-0.230 

-0.232 

-0.232 

90 

3-0(CH2)30C6H4-3-NHC0C6H4-3-S02F  ' 

-0.230 

-0.210 

-0.211 

91 

3-0(CH2)2OC6H4-3-NHCOC6H4-3-SO,F 

-0.229 

-0.222 

-0.222 

92 

3-0(CH2)30C6H4-4-CH30-NHC0C6H4-4-S0,F 

-0.229 

-0.227 

-0.227 

93 

3-0(CH2)3OC6H4-3-NHCONHC6H4-3-SO,F  ‘ 

-0.222 

-0.216 

-0.216 

94 

3-0(CH2)3OC6H4-3-NHCOCH,C6H4-4-SO,F 

-0.220 

-0.222 

-0.222 

95 

3-0(CH2)3OC6H4-3-NHCOC<,H4-4-S02F 

-0.219 

-0.224 

-0.224 

96 

3-0(CH2)30C6H4-2-NHC0NHC6H3-2-CI-5-S0,F 

-0.217 

-0.235 

-0.235 

97 

3-0(CH2)3OQH4-3-NHCOCH,OC6H4-4-SO,f‘ 

-0.217 

-0.218 

-0.218 

98 

3-0(CH,)2OC6H4-3-NHCON  HC6H4-4-SO,  f’ 

-0.216 

-0.245 

-0.244 

99 

3-0(CH2)40C6H4-3-NHC0NHC6H4-4-S0> 

-0.215 

-0.229 

-0.229 

100 

3-0(CH2)3OCcH4-3-NHCOC„H4-4-NO, 

-0.214 

-0.226 

-0.226 

101 

S-OCCHjJjOQHj-S-NHCOQHj^-SO/f 

-0.214 

-0.238 

-0.237 

102 

3-0(CH2)4OC6H4-2-NHCONHQH,-2-CI-5-SO,F 

-0.207 

-0.231 

-0.231 

103 

3-0(CH,)3OC6H4-3-NHCONHC6H4-4-NO. 

-0.204 

-0.233 

-0.232 

104 

3-O(CH,)3OQ,HJ-4-CHr3-NHCONHC6HJ-4-S0,F 

-0.204 

-0.224 

-0.223 

105 

3-0(CH2)3OC6H4-3-NHCONH(CH2),C6H4-4-SO> 

-0.193. 

-0.203 

-0.203 

11  CV  and  FIT  values  are  calculated  using  Eq.  (8). 
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Table  2 

69  benzene  derivatives  and  their  observed  and  calculated  (cross-validated,  CV,  and  fitted,  FIT)  fathead  minnow  toxicities,  expressed 
as  -log(LC50) 


No. 

Compound 

-log(LC5n) 

Observed 

Calculated  (CV)a 

Calculated  (FIT)1' 

1 

Benzene 

3.40 

3.29 

3.32 

2 

Bromobenzene 

3.89 

4.04 

4.01 

3 

Chlorobenzene 

3.77 

3.75 

3.75 

4 

Phenol 

3.51 

3.31 

3.35 

5 

Toluene 

3.32 

3.51 

3.49 

6 

1.2-Dict)lorobenzene 

4.40 

4.33 

4.33 

7 

1 .3-Dichlorobenzene 

4.30 

4.10 

4.12 

8 

l  .4-Dichlorobenzene 

4.62 

4.80 

4.77 

9 

2-Chlorophenol 

4.02 

4.01 

4.01 

10 

3-Chlorotoluene 

3.84 

3.72 

3.73 

11 

4-Chlorotoluene 

4.33 

4.11 

4.13 

12 

1 ,3-Dihydroxybenzene 

3.04 

3.31 

3.28 

13 

3-Hydroxyanisole 

3.21 

3.13 

3.14 

14 

2-Methylphenol 

3.77 

3.62 

3.62 

15 

3-Methylphenol 

3.29 

3.52 

3.51 

16 

4-Methylphenol 

3.58 

3.64 

3.64 

17 

4-Nitrophenol 

3.36 

3.68 

3.66 

18 

1 .4-Dimethoxybenzene 

3.07 

3.01 

3.01 

19 

1 .2-Dimethylbenzene 

3.48 

3.84 

3.81 

20 

1 .4-Dimethylbenzene 

4.21 

3.94 

3.97 

21 

2-Nitrotoluene 

3.57 

3.70 

3.69 

22 

3-Nitrotoluene 

3.63 

3.67 

3.66 

23 

4-nitrotoluene 

3.76 

3.71 

3.71 

24 

1 .2-Dinitrobenzene 

5.45 

4.95 

5.09 

25 

1.3-Dinitrobenzene 

4.38 

4.12 

4.15 

26 

1,4-Dinitrobenzene 

5.22 

4.83 

4.91 

27 

2-Methyl-3-nitroaniline 

3.48 

3.74 

3.73 

28 

2-Methyl-4-nitroaniline 

3.24 

3.50 

3.47 

29 

2-Methyl-5-nitroaniline 

3.35 

3.80 

3.77 

30 

2-Methyl-6-nitroaniline 

3.80 

3.76 

3.76 

31 

3-Methyl-6-nitroaniline 

3.80 

3.61 

3.62 

32 

4-Methyl-2-nitroaniline 

3.79 

3.78 

3.78 

33 

4-Hydroxy-3-nitroaniIine 

3.65 

3.51 

3.52 

34 

4-MethyI-3-nitroaniline 

3.77 

3.78 

3.78 

35 

1 ,2,3-Trichlorobenzene 

4.89 

4.84 

4.84 

36 

1 ,2,4-Trichlorobenzene 

5.00 

5.02 

5.02 

37 

1 ,3,5-Trichlorobenzene 

4.74 

4.36 

4.45 

38 

2,4-Dichlorophenol 

4.30 

4.53 

4.52 

39 

3,4-Dichlorotoluene 

4.74 

4.46 

4.48 

40 

2,4-Dichlorotoluene 

4.54 

4.57 

4.56 

41 

4-Chloro-3-methylphenol 

4.27 

4.27 

4.27 

42 

2,4-Dimethylphenol 

3.86 

3.74 

3.76 

43 

2,6-Dimethylphenol 

3.75 

3.75 

3.75 

44 

3,4-DimethylphenoI 

3,90 

3.90 

3.90 

45 

2,4-Dinitrophenol 

4.04 

4.03 

4.04 

46 

l  ,2,4-Trimethylbenzene 

4.21 

4.07 

4.09 

47 

2,3-Dinitrotoluene 

5.01 

5.29 

5.21 

48 

2,4-Dinitrotoluene 

3.75 

4.29 

4.27 

49 

2,5-Dinitrotoluene 

5.15 

4.89 

4.93 

50 

2,6-Dinitrotoluene 

3.99 

4.43 

4.41 

51 

3,4-Dinitrotoluene 

5.08 

5.29 

5.23 

52 

3.5-Dinitrotoluene 

3.91 

4.25 

4.23 

53 

1 ,3,5-Trinitrobenzene 

5.29 

5.29 

5.29 

54 

2-MethyI-3,5-dinitroaniline 

4.12 

4.23 

4.22 
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No.  Compound  -log(LC<(1) 


Observed 

Calculated  (CV)1' 

Calculated  (I  IT)' 

55 

2-Mcthyl-3,6-dinitroaniline 

5.34 

4.59 

4.64 

56 

3-Methyl-2,4-dinitroaniline 

4.26 

3.97 

4.00 

57 

5-Methyl-2,4-dinitroaniline 

4.92 

3.88 

3.97 

58 

4-Mcthyl-2,6-dinitroanilinc 

4.21 

4.76 

4.72 

59 

5-Mcthyl-2,6-dinitroanilinc 

4.18 

4.64 

4.61 

60 

4-Methyl-3,5-dinitroaniline 

4.46 

4.33 

4.34 

61 

2,4,6-Tiribromophenol 

4.70 

4.98 

4.82 

62 

1 ,2,3,4-Tetrachlorobcnzcne 

5.43 

5.55 

5.53 

63 

1 ,2,4,5-Tetrachlorobenzcne 

5.85 

5.76 

5.77 

64 

2,4,6-Trichlorophenol 

4.33 

4.68 

4.64 

65 

2-Methyl-4,6-dinitrophenol 

5.00 

4.45 

4.48 

66 

2,3,6-Trinitrotoluene 

6.37 

6.39 

6.38 

67 

2,4,6-Trinitrotoluene 

4.88 

5.32 

5.26 

68 

2,3,4.5-Tetrachlorophenol 

5.72 

5.64 

5.65 

69 

2,3,4,5,6-Pentachlorophenol 

6.06 

6.01 

6.03 

a  CV  and  FIT  values  are  calculated  using  Eq.  (10). 


which  was  achieved  by  the  orthogonalization  of  de¬ 
scriptors,  because  in  the  orthogonal  basis  the  computa¬ 
tion  of  R  is  much  faster  and  simpler  (Lucic  et  al.. 
1995a, b,c;  Lucic,  1997).  Namely,  in  the  case  one  has  the 
MR  model  based  on  the  set  of  I  orthogonalized  de¬ 
scriptors  di  (7=1,  /),  the  correlation  coefficient 

between  the  experimental  values  of  modeled  activity  A 
and  the  values  estimated  by  the  model  Acil  can  be 
calculated  in  a  very  simple  way  (Eq.  (1)): 


where  Ri  is  the  correlation  coefficient  between  each 
orthogonalized  descriptor  di  and  the  modeled  activity 
A.  For  example,  using  this  procedure  it  takes  28  CPU 
min  on  Hewlett-Packard  9000/E55  computer,  which  is 
configured  as  a  server,  to  select  the  best  MR  model 
with  five  out  of  104  descriptors  among  -  108  possible 
models. 

3.  Results  and  discussion 

3. 1.  QSAR  of  benzamidines 

The  best  one-descriptor  structure-complement -in¬ 
hibitory  activity  model  of  benzamidines  obtained  is: 

1/log  C—  —  0.9332(  ±  0.0229)  +  0.4395(  ±  0.0152 )HV 

n  —  105  R  —  0.943  Rcv  =  0.941  5  =  0.0195  Scv 


where  //v  is  the  graph-vertex  complexity  (Basak,  1987), 
ii  is  the  number  of  benzamidine  derivatives  considered, 
R  is  the  correlation  coefficient,  Rcv  is  the  leave-one-out 
(cross-validated)  correlation  coefficient,  F  is  F-value,  S 
is  the  standard  error  and  Scv  is  the  cross-validated 
(leave-one-out)  standard  error  of  estimate  (root-mean- 
square  error),  both  with  N-2  in  the  denominator.  This 
model  is  only  slightly  better  than  the  earlier  obtained 
one-descriptor  model,  but  with  a  different  descriptor 
(Basak  et  al.,  1999a): 

1  /log  C  =  -  0.6428(  ±  0.0 1 29)  4-  0.0490(  ±  0.00 1 7)3D  W 
n  =  *105  R  =  0.943  Rc v  =  0.940  5  =  0.0196  Scv 

=  0.0200  F=  824  (3) 

where  30  IT  is  the  3-D  Wiener  number  for  the  hydrogen- 
suppressed  structures  computed  using  their  geometric 
distance  matrices  (Bogdanov  et  al.,  1989).  Close  to  this 
model  is  a  model  with  3-D  Wiener  number  computed 
for  structures  containing  all  atoms  including  hydrogens 
(Bosnjak  et  al.,  1991)  (/i  =  105,  R-  0.941,  Rcv  —  0.939, 
5  =  0.0199  5CV  =  0.0203). 

The  best  two-descriptor  model  of  the  benzamidine 
structure -complement-inhibitory  activity  is: 

1  /log  C  =  -  0.6878(  ±  0.0 1 75)  +  0.1 327(  ±  0.0367)  W 

+  0.1864(±0.0?S0)3DPF 

/i=  105  R  =  0.950  Rcy  =  0.947  5  =  0.0184  5CV 

=  0.0189  5  =  467  (4) 

where  I V  is  the  2-D  Wiener  number  (Wiener,  1947). 
The  best  three-descriptor  model  is  given  by: 


=  0.0199  5=832 
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Table  3 

Descriptions  of  all  considered  descriptors  and  symbols  of  only 
those  descriptors  involved  in  the  models 

Information  index  for  the  magnitude  of  distances 
between  all  possible  pairs  of  vertices  of  a  graph 
Mean  information  index  for  the  magnitude  of  dis¬ 
tance 

IV  Wiener  index,  the  half-sum  of  the  off-diagonal  ele¬ 
ments  of  the  molecular  distance  matrix 
Degree  complexity 
//v  Graph  vertex  complexity 
Graph  distance  complexity 
Information  content  of  the  distance  matrix  parti¬ 
tioned  by  frequency  of  occurrences  of  distance  / 
Information  content  of  the  hydrogen-suppressed 
graph  at  its  maximum  neighborhood  of  vertices 
Order  of  neighborhood  when  ICr  reaches  its  maxi¬ 
mum  value  for  the  hydrogen -filled  graph 
A  Zagreb  group  parameter,  the  sum  of  square  of 
degree  over  all  vertices 

A  Zagreb  group  parameter,  the  sum  of  cross- 
product  of  degrees  over  all  neighboring  (con¬ 
nected)  vertices 

Icr  Mean  information  content  of  a  graph  based  on 

the  rth  (r  =  0-6)  order  neighborhood  of  vertices  in 
a  hydrogen-filled  graph 

SICr  Structural  information  content  for  rth  (r  =  0-6)  or¬ 
der  neighborhood  of  vertices  in  a  hydrogen-filled 
graph 

CICr  Complementary  information  content  for  rth  (r  = 

0-6)  order  neighborhood  of  vertices  in  a  hydro¬ 
gen-filled  graph 

Path  connectivity  index  of  order  h  —  0-6 
Cluster  connectivity  index  of  order  h  ~  3-6  m 
Chain  connectivity  index  of  order  h  —  6 
Path-cluster  connectivity  index  of  order  h  —  4-6 
Bond  path  connectivity  index  of  order  h  —  0-6 
hXc  Bond  cluster  connectivity  index  of  order  h  =  3-6 
l,/ch  Bond  chain  connectivity  index  of  order  h~6 

Bond  path-cluster  connectivity  index  of  order  h  = 

4-6 

hZy  Valence  path  connectivity  index  of  order  h  =  0-6 

h/.c  Valence  cluster  connectivity  index  of  order  It  -  3-6 

AXch  Valence  chain  connectivity  index  of  order  h  —  6 
Valence  path-cluster  connectivity  index  of  order 
h  =  4—6 

P,  Number  of  paths  of  length  /  =  0-10 
Balaban’s  J  index  based  on  distance 
Balaban’s  J  index  based  on  relative  electronegativi¬ 
ties 

Balaban’s  J  index  based  on  relative  covalent  radii 
Balaban’s  J  index  based  on  bond  types 
Energy  of  the  highest  occupied  molecular  orbital 
Energy  of  the  second  highest  occupied  molecular 
orbital 

^lumo  Energy  of  the  lowest  unoccupied  molecular  orbital 
Energy  of  the  second  lowest  unoccupied  molecular 
orbital 

A Hf  Heat  of  formation 


Tabic  3  (Continued) 

//  Dipole  moment 

Van  der  WaaISs  volume 

30  IP,  i  3-D  Wiener  index  for  the  hydrogen -filled  geometric 
distance  matrix 

3D IP  3-D  Wiener  index  for  the  hydrogen-suppressed  geo¬ 
metric  distance  matrix 

1  /log  C  =  -  0.6400(  ±  0.0239)  +  0. 1 273(  ±  0.0355)  W 

+  0.0103(  ±  0.0037)/>9 

4-  0. 1 698(  ±  0.0372)?D  W 

n=  105  F  =  0.954  Fcv  =  0.949  5  =  0.0177  5CV 

=  0.0185  F=  335  (5) 

where  P9  is  the  path  of  length  nine.  P9  could  be  omitted 
from  Eq.  (5)  because  the  related  value  of  error  of 
regression  coefficient  is  relatively  large  comparing  to 
the  value  of  regression  coefficient.  Then  Eq.  (5)  simply 
converts  into  Eq.  (4).  The  best  four-descriptor  model  is: 

1/log  C  =  -  0.6999(  ±  0.0 1 94)  +  0. 1 321  (  ±  0.0354)  IV 

+  5.0332(±  1.228 5)6^h 

-5.1120  (±  1.2486)6*tvh 

+  0. 1 885(  ±  0.0359)3D  W 

n  =  105  F  =  0.957  Fcv  =  0.953  5  =  0.0170  5CV 

=  0.0177  F=  272  (6) 

where  6*£h  and  6/*h  denote  the  bond-chain  and  valence- 
chain  connectivity  indices  of  order  six,  respectively. 

Hansch  and  Yoshimoto  (Hansch  and  Yoshimoto, 
1974)  published,  25  years  ago,  the  following  four-de¬ 
scriptor  model  for  benzamidine  derivatives  inhibiting 
complement  (the  model  is  given  in  their  notation): 

log(l/C)  =  0.1 5(  ±  0.03)(MR  -  1.2) 

+  1 .07(  ±  0. 1 3)(D- 1 )  +  0.52(  ±  0.28)(D-2) 

+  0.43(  ±  0. 14)(D-3)  +  2.425(  ±  0. 1 2) 

«=  108  R  =  0.935  5  =  0.258  (7) 

where  MR  is  the  molar  refractivity  of  substituents  at 
positions  1  and  2,  taken  from  the  compilation  by 
Hansch  et  al.  (Hansch  et  al.,  1973)  or  computed,  while 
D-l,  D-2,  and  D-3  are  indicator  variables  for  the 
presence  or  absence  of  three  kinds  of  the  substructural 
units  in  a  given  benzadimine.  To  compare  fitted  statisti¬ 
cal  parameters  of  our  four-descriptor  model  (Eq.  (6)) 
with  those  of  model  given  by  Eq.  (7),  we  retransformed 
our  results  into  a  log  (1/C)  scale  used  by  Hansch  and 
Yoshimoto.  Thus,  we  obtained  statistical  parameters 
(R  =  0.941  and  5  =  0.237)  that  are  comparable  with 
their  result.  However,  Hansch  and  Yoshimoto  consid¬ 
ered  108  benzamidine  derivatives  and  we  only  consid- 
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ered  105.  This  discrepancy  is  caused  by  problematic 
data  for  three  compounds  which  in  our  case  are  dis¬ 
carded  from  the  set  of  benzamidinc  derivatives  (Basak 
et  al.,  1999a).  But,  the  nature  of  descriptors  used  in 
these  two  types  of  models  is  different.  Descriptors  used 
by  us  are  calculated  solely  from  the  structures  of  stud¬ 
ied  molecules  while  the  Hansch-Yoshimoto  parameters 
(molar  refractivities  of  substituents)  are  experimentally- 
based. 

Finally,  the  five-descriptor  model  is: 
i /log  C  =  1 .5264(  ±  0.3534)  +  0.6323(  ±  0.0936)(IC)2 

-  1.6788(±0.2720)(IC)6 

-  1.4540(±  0.2043)(SIC)j 
-0.4239(±0.0680)(CIC)6 
+  0.1286(±0.0149)3DfF 

«=  105  /?  =  0.963  Rcv  =  0.957  5  =  0.0158  5CV 

=  0.0170  F—  253  (8) 

where  (IC)2  and  (IC)6  denote  the  mean  information 
content  of  structure  based  on  the  second-  and  sixth- 
order  neighborhood  of  atoms,  including  hydrogens,  in 
the  structure,  respectively,  (SIC),  and  (CIC)6  are,  re¬ 


spectively,  the  structural  information  content  for  the 
first  order  neighborhood  and  complementary  informa¬ 
tion  content  for  the  sixth  order  neighborhood  of  atoms, 
including  hydrogens,  in  the  structure.  (IC),.,  (SIC)r  and 
(CIC)r  are  molecular  complexity  indices  introduced 
some  times  ago  by  one  of  us  (Basak,  1987)  for  use  in 
predictive  pharmacology  and  toxicology. 

It  is  interesting  to  note  that  the  3-D  Wiener  number 
is  present  in  all  models  given  here,  except  in  the  very 
best  model  with  a  single  descriptor,  although  is  present 
in  the  next  best  single-descriptor  model.  This  is  not 
surprising  because  this  descriptor  has  shown  to  be  very 
useful  in  the  structure-property-activity  modeling 
(Bogdanov  et  al„  1989;  Bosnjak  et  al.,  1991;  Mihalic 
and  Trinajstic,  1991;  Nikolic  et  al.,  1991;  Trinajstic, 
1992). 

The  models  containing  more  decriptors  did  not  out¬ 
perform  the  above  five-descriptor  model.  Thus,  the 
model  with  five-descriptors  (Eq.  (8)),  selected  from  the 
initial  set  of  descriptors,  is  the  best  QSAR  model, 
according  to  the  calculated  cross-validated  statistical 
parameters,  for  predicting  the  benzamidine  structure- 
complement-inhibitory  activity.  This  model  is  better 
than  one-descriptor  model  previously  obtained  using 
hierarchical  approach  (Basak  et  al.,  1999a).  However, 


calculated (cross -validated) 1/logC 

Fig.  2.  A  plot  of  observed  versus  calculated  (cross-validated)  1/log  C  complement-inhibitory  activity  of  benzamidincs. 
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calculated  (cross-validated)  -logLC50) 

Fig.  3.  A  plot  of  observed  versus  calculated  (cross-validated)  —  Iog(LC50)  benzene  derivatives  acute  toxicities. 


according  to  F-values  one-descriptor  models  selected  in 
this  paper  and  our  previous  work  (Basak  et  al.,  1999a) 
appear  to  be  better  models  than  the  model  with 
five-descriptors.  But,  the  F-value  is  calculated  only 
from  the  fitted  correlation  coefficient  R  and  taking  into 
account  the  number  of  parameters  optimized  in  the 
model.  Because  it  is  accepted  (Ortiz  et  al.,  1997)  that 
the  cross-vali dated  statistical  parameters  give  better 
evidence  into  the  model  quality  than  fitted  statistical 
parameters,  our  final  conclusions  are  based  on 
cross-validated  statistical  parameters,  although  the 
prediction  for  compounds  from  an  external  data  set 
would  be  the  best  way  of  model  quality  testing.  A  plot 
between  the  experimental  and  predicted  values, 
calculated  in  the  cross-validation  procedure  using  Eq. 
(8),  of  1/log  C  is  given  in  Fig.  2.  Computed  (fitted  and 
leave-one-out  cross- validated)  1/log  C  values  are  given 
in  Table  1. 

3.2.  QSAR  of  benzene  derivatives 

The  best  linear  five-descriptor  structure-toxicity 
model  of  benzene  derivatives  selected  by  CROMRsel 
program  is: 

—  Iog(LC50) 

=  5.2032(  ±  0.546)  +  0.8488(  ±  0. 106)Py 


+ 1  -7979(  ±  0. 1 83)4*  -  0.4439(  ±  0.0523)£lumo 
- 0. 1 379(  ±  0.0 1 95)//  -  0.296 1  ( ±  0.0927)3D  WH 
n  —  69  R  =  0.927  Rcv  =  0.914  S' =  0.287  Scv  =  0.312 
F=77  (9) 

where  P9  is  the  path  of  length  nine,  4yfPc  valence 
path-cluster  connectivity  index  of  order  four,  £Iumo  is 
the  energy  of  the  lowest  unoccupied  molecular  orbital, 
/*  is  dipole  moment,  and  3DH^h  is  the  3-D  Wiener 
number  for  the  hydrogen-filled  structures  computed 
using  their  geometric  distance  matrices  (Bogdanov  et 
al.,  1989).  This  model  has  two  descriptors  fewer  than 
the  best  model  obtained  by  hierarchical  approach  (see 
Gute  and  Basak,  1997)  and  possesses  almost  the  same 
statistical  parameters. 

The  best  linear  seven-descriptor  model  is: 

-  !og(LC50) 

=  4.4 100(  ±  0.809)  +  0.8637(  ±  0.0988)P9 
+  2.5278(  ±  0.833)  V  -  3.1248(  ±  0.655)V 
+  1.5628(  ±  0.372)6/pc  -  0.44 157(  ±  0.051)£Iumo 
-  0. 1 364(  ±  0.0 1 8)//  -  0.34054(  ±  0.087)3D  WH 
n  =  69  R  -  0.940  Rcv  =  0.925  5  =  0.262  Scv  =  0.29 1  F 
=  66  (10) 
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where  2/y  and  4/v  denote  valence  path  connectivity 
indices  of  order  two  and  four,  respectively,  and  6/pc  is 
the  valence  path-cluster  connectivity  index  of  order 
six.  Other  descriptors  are  the  same  as  those  from  five- 
descriptor  model  (Eq.  (9)).  This  model  (R2  =  0.884, 
F=66,  5  =  0.26)  is  better  than  the  seven-descriptor 
model  obtained  by  hierarchical  procedure  (see  Gute 
and  Basak,  1997)  (R2  =  0.863,  F=50,  5  =  0.30),  and 
one  can  see  that  these  two  models  contain  three  iden¬ 
tical  descriptors:  P9  *dIVUh  and  //.  Fitted  and  cross- 
validated  predicted  values  for  all  benzene  derivatives 
obtained  using  Eq.  (10)  are  given  in  Table  2.  A  plot 
between  the  experimental  and  predicted  values,  calcu¬ 
lated  in  the  cross-validation  procedure  using  Eq.  (10), 
of  -  log(LC50)  is  given  in  Fig.  3. 

We  also  found  several  seven-descriptor  linear  multi¬ 
regression  models  with  better  statistical  prameter  than 
the  best  seven-descriptor  model  of  Gute  and  Basak 
(see  Gute  and  Basak,  1997).  One  of  them  is  very 
similar  to  the  model  given  as  Eq.  (10)  and  involving 
the  following  set  of  descriptors  Hv,  P9f  3/£,  5xZ,  A//f, 
//,  30  WH  (see  Table  3  for  description  of  descriptors), 
and  possessing  the  following  statistical  parameters 
F  =  0.9398,  Rcv  =  0.9245,  5  =  0.262,  5CV  =  0.292,  F  = 
66). 

In  addition,  we  perform  modeling  in  order  to  com¬ 
pare  our  seven-descriptor  model  with  the  additivity 
model  (using  eight  terms,  i.e.  eight  optimized  parame¬ 
ters)  derived  by  Hall  et  al.  (Hall  et  al.,  1984).  To  do 
this  we  omitted  from  the  data  set  compounds  53,  57 
and  65,  which  were  identified  in  by  Hall  et  al.  as 
outliers.  For  66  compounds  statistical  parameters  of 
seven-descriptor  model  (Eq.  (10))  are:  R  =  0.955, 
/^v==  0.943,  5  =  0.225,  5CV  =  0.255  F=87).  This 
parameters  are  better  than  those  for  additivity  models 
obtained  by  Hall  et  al.  (R  =  0.951,  5  =  0.249,  F=  67). 


4.  Concluding  remark 

Presented  results  show  that  the  optimum  way  to 
carry  out  QSAR  modeling  is  by  selecting  the  best 
descriptors  in  (linear,  as  was  the  case  here,  or  nolinear 
(Lu£ic  and  Trinajstic,  1999)  multiregression  models. 


Acknowledgements 

This  is  contribution  number  256  from  the  Center 
for  Water  and  the  Environment  of  the  Natural  Re¬ 
sources  Research  Institute.  This  work  was  supported, 
in  part,  by  Grants  F49620-94- 1-0401  and  F49620-96-1- 
0330  from  the  United  States  Air  Force  and  00980606 
from  the  Ministry  of  Science  and  Technology  of  the 
Republic  of  Croatia.  We  thank  reviewers  for  helpful 
comments. 


References 

Basak,  S.C.,  1987.  Use  of  molecular  complexity  indices  in 
predictive  pharmacology  and  toxicology.  Med.  Sci.  Res. 
15,605-609. 

Basak,  S.C.,  Gute,  B.D.,  Ghatak,  S.,  1999a.  Prediction  of 
complement-inhibitory  activity  of  bcnzamidincs  using  to¬ 
pological  and  geometric  parameters.  J.  Chcm.  Inf.  Corn- 
put.  Sci.  39,  255-260. 

Basak,  S.C.,  Gute,  B.D.,  Grunwald,  G.D.,  1997.  Use  of  to- 
postructural,  topochemical  and  geometric  parameters  in 
the  prediction  of  vapor  pressure:  a  hierarchical  QSAR 
approach.  J.  Chcm.  Inf.  Comput.  Sci.  37,  651-655. 

Basak,  S.C.,  Gute,  B.D.,  Grunwald,  G.D.,  1999b.  In:  Dcv- 
illers,  J.,  Balaban,  A.T.  (Eds.),  Topological  indices  and 
related  descriptors  in  QSAR  and  QSPR.  Gordon  and 
Breach,  Reading,  pp.  245-261. 

Basak,  S.C.,  Gute,  B.D.,  Opitz,  D.W.,  Balasubramanian,  K., 
1999c.  Use  of  Statistical  and  Neural  Net  Methods  in 
Predicting  Toxicity  of  Chemicals:  A  Hierarchical  QSAR 
Approach.  Reported  at  the  American  Association  of  Artifi¬ 
cial  Intelligence  (AAAI)  Conference  —  Predictive  Toxicol¬ 
ogy  of  Chemicals:  Experiences  and  Impact  of  Al  Tools, 
Stanford  University,  March  22-24. 

Bogdanov,  B.,  Nikolic,  S.,  Trinajstic,  N.,  1989.  On  the  three- 
dimensional  Wiener  number.  J.  Math.  Chem.  3,  299-309. 

Bosnjak,  N.,  Mihalic,  Z.,  Trinajstic,  N.,  1991.  Application  of 
topographic  indices  to  chromatographic  data:  calculation 
of  the  retention  indices  of  alkanes.  J.  Chromatogr.  540, 
430-440. 

Free,  S.M.,  Wilson,  J.W.,  1964.  A  mathematical  contribution 
to  structure-activity  studies.  J.  Med.  Chcm.  I,  395-399. 

Gute,  B.D.,  Basak,  S.C.,  1997.  Predicting  acute  toxicity  (LC50) 
of  benzene  derivatives  using  theoretical  molecular  descrip¬ 
tors:  a  hierarchical  QSAR  approach.  SAR  QSAR  Environ. 
Res.  7,  117-131. 

Hall,  L.H.,  ICier,  L.B.,  Phipps,  G.,  1984.  Structure -activity 
relationship  studies  on  the  toxicities  of  benzene  derivatives: 
I.  an  additivity  model.  Environ.  Toxicol.  Chem.  3,  355— 
365. 

Hansch,  C.,  Leo,  A.,  Unger,  S.H.,  Kim,  K.H.,  Nikaitani,  D., 
Lien,  E.J.,  1973.  Aromatic  substituent  constants  for  struc¬ 
ture-activity  correlations.  J.  Med.  Chem.  16,  1207-1216. 

Hansch,  C.,  Yoshimoto,  M.,  1974.  Structure-activity  relation¬ 
ships  in  immunochemistry.  2.  Inhibition  of  complement  by 
benzamidines.  J.  Med.  Chem.  17,  1160-1167. 

Katritzky,  A.R.,  Chen,  K.,  Wang,  Y.,  Karelson,  M.,  Ludic,  B., 
Trinajstic,  N.,  Suzuki,  T.,  Schuurmann,  G.,  1999.  Predic¬ 
tion  of  liquid  viscosity  for  organic  compounds  by  a  quanti¬ 
tative  structure-property  relationship.  J.  Phvs.  Org.  Chcm. 
(in  press). 

Kubinyi,  H.,  Kehrhahn,  O.PL,  1976.  Quantitative  structure- 
activity  relationships:  a  comparison  of  different  frec- 
Wilson  models.  J.  Med.  Chcm.  19,  1040-1045. 

Kuby,  J.,  1992.  Immunology.  Freeman,  New  York. 

Lucic,  B.,  1997.  Ph.  Dissertation,  University  of  Zagreb,  Za¬ 
greb.  CROMRsel.f  (CROatian  MultiRegression  selection 
of  descriptors)  is  a  computer  program  for  the  selection  of 
descriptors  for  the  best  MR  models. 


191 


S.C.  Basak  ct  a/.  Computers  &  Chemistry  24  (2000)  181-191 


Lucie,  B.,  Amic,  D.,  Trinajstic,  N.,  1999a.  Nonlinear  multivari¬ 
ate  regression  outperforms  several  concisely  designed  neural 
networks  in  QSAR  modeling.  J.  Chem.  Inf.  Comput.  Sci.  (in 
press). 

Lucic,  B.,  Nikolic,  S.,  Trinajstic,  N.,  Juretic,  D.,  1995a.  The 
structure- property  models  can  be  improved  using  the  or- 
thogonalized  descriptors.  J.  Chem.  Inf.  Comput.  Sci  35 

cio  *  ’ 


Lucic,  B.,  Nikolic,  S.,  Trinajstic,  N.,  Juretic,  D.,  Juric,  A.,  1 995b. 
A  novel  QSPR  approach  to  physicochemical  properties  of 
the  a-amino  acids.  Croat.  Chem.  Acta  68,  435-450. 

Lucic,  B.,  Nikolic,  S.,  Trinajstic,  N.,  Juric,  A.,  Mihalic,  Z., 
1995c.  A  structure- property  study  of  the  solubility  of 
aliphatic  alcohols  in  water.  Croat.  Chem.  Acta  68, 4 1 7-434 
Lucic,  B.,  Trinajstic,  N.,  1999.  Multivariate  regression  outper¬ 
forms  several  robust  architectures  of  neural  networks  in 
QSAR  modeling.  J.  Chem.  Inf.  Comput.  Sci.  39,  121-132 
Lucic,  B.,  Trinajstic,  N.,  Sild,  S.,  Karelson,  M.,  Katritzky,  A.R.! 
1999b.  A  new  efficient  approach  for  variable  selection  based 
on  multiregression:  rediction  of  gas  chromatographic  reten¬ 
tion  times  and  response  factors.  J.  Chem.  Inf.  Comput  Sci 
39,  610-621. 


Mihahc,  Z.,  Trinajstic,  N.,  1991.  The  algebraic  modelling  of 
chemical  structures:  on  the  development  of  three-dimen¬ 
sional  molecular  descriptors.  J.  Mol.  Struct.  (Theochem.) 
232,  65—78. 


Nikolic,  S„  Trinajstic,  N..  Mihalic,  Z„  Carter,  S„  1991.  On  the 
geometric  distance  matrix  and  the  corresponding  structural 
invariants  of  molecular  systems.  Chem  Phvs  Lett  I7Q 
21-28.  ’ 

Ortiz,  A.R.,  Pastor,  M.,  Palomer,  A.,  Cruciani,  G.,  Gago,  F., 

Wade,  R.C.,  1997.  Reliability  of  comparative  molecular  field 

analysis  models:  effect  of  data  scaling  and  variable  selection 
using  a  set  of  human  synovial  fluid  phospholipase  A-, 
inhibitors.  J.  Med.  Chem.  40,  1136-1148. 

Trinajstic,  N.,  1992.  Chemical  Graph  Theory,  second  revised. 

CRC,  Boca  Raton,  FL,  pp.  262-269. 

Wiener,  H.,  1947.  Structural  determination  of  paraffin  boiling 
points.  J.  Am.  Chem.  Soc.  69,  17-20. 

Nikolic,  S.,  Trinajstic,  N.,  Mihalic,  Z.,  Carter,  S.,  1991.  On  the 
geometric  distance  matrix  and  the  corresponding  structural 
invariants  of  molecular  systems.  Chem.  Phvs  Lett  179 
21-28.  ’ 

Ortiz,  A.R.,  Pastor,  M.,  Palomer,  A.,  Cruciani,  G.,  Gago,  F., 
Wade,  R.C.,  1997.  Reliability  of  comparative  molecular  field 
analysis  models:  effect  of  data  scaling  and  variable  selection 
using  a  set  of  human  synovial  fluid  phospholipase  A? 
inhibitors.  J.  Med.  Chem.  40,  1136-1148. 

Trinajstic,  N.,  1992.  Chemical  Graph  Theory,  second  revised. 

CRC,  Boca  Raton,  FL,  pp.  262-269. 

Wiener,  H.,  1947.  Structural  determination  of  paraffin  boiling 
points.  J.  Am.  Chem.  Soc.  69,  17-20. 


Appendix  1.7  Construction  of  high-quality  structure-property- 
activity  regressions:  The  boiling  points  of  sulfides 


J.  Chem.  Inf.  Comput.  Sci.  2000,  40,  899-905 


899 


Construction  of  High-Quality  Structure-Property-Activity  Regressions:  The  Boiling 

Points  of  Sulfides 

Milan  Randic***  and  Subhash  C.  Basak§i 

Department  of  Mathematics  and  Computer  Science,  Drake  University,  Des  Moines,  Iowa  50311  and 
Natural  Resources  Research  Institute,  The  University  of  Minnesota,  5013  Miller  Trunk  Highway, 

Duluth,  Minnesota  5581  1 

Received  September  9,  1999 


Instead  of^using  the  standard  molecular  descriptors  (topological  indices)  for  regression  analysis,  which  are 
numerically  fully  determined  once  a  molecule  is  selected,  we  outline  the  use  of  variable  molecular  descriptors 
that  are  modified  during  the  search  for  the  best  regression.  The  approach  is  illustrated  using  boiling  points 
of  sulfides.  We  have  transformed  the  connectivity  index  x%  into  a  function  of  two  variables  (x,  y)  which 
differentiate  carbon  and  sulfur  atoms.  The  optimal  values  of  the  variables  (x,  y)  were  determined  by  minimizing 
the  standard  error  of  the  regression.  With  the  values  x  =  +0.25  and  y  -  -0.95  for  carbon  and  sulfur, 
respectively,  we  have  obtained  a  regression  based  on  a  single  descriptor  and  a  standard  error  of  1.8  °C. 
With  elimination  of  two  outliers  (having  a  deviation  of  about  4  °C)  the  standard  error  is  reduced  to  a 
remarkable  1.3  °C. 


INTRODUCTION 

The  past  decade  has  witnessed  two  important  develop¬ 
ments  of  multivariate  regression  analysis,  MRA,  relevant 
for  quantitative  structure- property-activity  relationship, 
QSAR:  (1)  expansion  of  mathematical  structural  descriptors 
for  characterization  of  molecular  structure;1-5  (2)  construction 
of  orthogonal  molecular  descriptors6-12  which  result  in  stable 
regression  equations.  The  first,  which  is  of  interest  when 
better  regressions  are  sought,  is  rather  conspicuous,  while 
the  second,  which  is  important  for  interpretation  of  the  results 
of  such  studies,  remains  not  yet  sufficiently  widely  appreci¬ 
ated. 

In  this  paper  we  will  address  the  problem  of  construction 
of  high-quality  regressions  (HQR).  With  hundreds  of  de¬ 
scriptors  available13-15  the  questions  to  consider  are  as 
follows:  (1)  How  should  an  optimal  set  of  descriptors  be 
chosen  from  a  large  number  of  available  descriptors?  (2) 
How  should  one  chose  between  regressions  of  seemingly 
similar  quality?  (3)  How  unique  are  regression  results?  (4) 
Are  there  important  structural  elements  missed  by  the 
descriptors  used?  (5)  How  complete  is  the  space  spanned 
by  molecular  descriptors  for  the  structure- property-activity 
studies?  (6)  Do  we  need  additional  molecular  descriptors? 

HIGH-QUALITY  REGRESSIONS 

The  standard  error  in  most  correlations  still  does  not 
approach  the  experimental  error  of  measurements.  How 
realistic  is  it  to  hope  to  arrive  at  this  goal?  As  we  will  show, 
HQR,  in  which  the  standard  error  has  been  dramatically 
reduced  in  comparison  with  traditional  approaches  using  the 
same  number  of  descriptors,  can  be  derived  with  a  new  kind 
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Table  1.  Standard  error  for  the  Boiling  Points  of  Smaller  Sulfides 
(n  =  21  Compounds)  for  Selection  of  Descriptors 


descriptors 

standard  error 

descriptors 

standard  error 

x,i 

2.001 

X 

2.701 

X." 

2.550 

n ,  J 

2.748 

2.560 

n,/?2/W2 

2.981 

x.w 

2.667 

7,  W 

4.808 

X.P2/W2 

2.692 

W,P 

5.109 

of  molecular  descriptors  which  involve  variability  that  allows 
one  to  optimize  the  descriptors  and  minimize  the  standard 
error  of  regression. 

In  Table  1  we  illustrate  the  standard  errors  for  correlations 
of  the  boiling  points  of  smaller  sulfides  (shown  in  Figure  1) 
using  a  selection  of  molecular  descriptors.  When  the  con¬ 
nectivity  index16  is  used  alone,  we  find  the  standard  error  of 
the  regression  is  2.70  °C,  as  shown  in  the  middle  of  Table 
1.  When  the  connectivity  index  is  combined  with  Balaban’s 
J  index,17  the  standard  error  is  further  reduced  to  2.00  °C. 
Other  descriptors,  viz.,  ji,  the  number  of  non-hydrogen  atoms, 
?3,  the  number  of  paths  of  length  3,  W,  the  Wiener  index,18 
and  the  path/walk  quotients,19  give  only  a  minor 

improvement  for  the  standard  error  over  that  based  on  x% 
used  alone.  In  contrast  other  combinations  of  molecular 
descriptors  (listed  in  the  right  part  of  Table  1)  do  not  give 
satisfactory  results.  The  standard  error  in  such  combinations 
is  worse  than  the  standard  error  when  the  connectivity  index 
is  used  as  a  single  descriptor,  which  well-illustrates  the 
importance  of  the  proper  selection  of  molecular  descriptors. 

The  compounds  considered  here  were  among  45  saturated 
acyclic  compounds  possessing  divalent  sulfur  atoms  for 
which  Balaban  et  al.20  found  reliable  literature  data.  We  took 
all  compounds  having  six  or  fewer  carbon  atoms,  a  total  of 
21,  and  have  recalculated  the  regressions  for  only  these 
smaller  sulfides.  The  study  of  Balaban  and  co-workers 
considered  a  broader  class  of  compounds:  185  saturated 
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Figure  1.  Molecular  graphs  of  smaller  sulfides  and  their  boiling 
points.  The  sulfur  atoms  are  shown  as  a  filled  circle. 


acyclic  compounds  possessing  divalent  oxygen  or  sulfur 
atoms,  and  devoid  of  hydrogen  bonding,  having  1 1  or  less 
non-hydrogen  atoms.  Their  purpose  was  as  follows:  (i)  to 
explore  the  role  of  heteroatoms  within  acyclic  skeletons  in 
determining  a  measured  molecular  property  (boiling  points); 
(ii)  to  show  that  topological  descriptors  can  satisfactorily 
account  for  the  observed  relative  magnitudes  of  the  property; 
and  (iii)  to  derive  structure-- property  regressions  that  may 
be  useful  for  predicting  boiling  points  of  unknown  com¬ 
pounds. 


Our  objectives  are  the  same,  but  our  philosophy  in  this 
particular  study  is  somewhat  different:  Rather  than  consider¬ 
ing  a  large  set  of  mixed  compounds  (alkanes,  ethers,  diethers, 
acetals,  and  peroxides  as  well  as  their  sulfur  analogues: 
sulfides  (thioethers),  bis-sulfides,  thioacetals,  and  disulfides), 
which  allows  one  to  use  several  molecular  descriptors  and 
still  maintain  high  statistical  significance  for  the  correlation, 
we  decided  to  use  only  structurally  closely  related  com¬ 
pounds.  In  particular,  we  excluded  bis-sulfides  and  disulfides 
because  of  the  presence  of  S-S  linkage  that  is  absent  in 
sulfides.  This  has  reduced  the  pool  of  the  compounds 
considerably,  which  limits  the  number  of  descriptors  that  one 
should  use  in  analyzing  the  data.  By  homogenizing  the 
sample  of  the  compounds  to  be  examined,  as  we  will  see, 
we  can  achieve  a  very  high  quality  regression  result  using  a 
single  descriptor. 

As  we  see  from  Table  1,  apparently  it  is  difficult  to  reduce 
the  standard  error  for  the  boiling  points  of  sulfides  below 
2.5  °C.  Among  the  combinations  listed  in  Table  1,  only 
Balaban’s  J  reduced  the  standard  error  below  2.5  °C.  This 
may  not  be  surprising  because  all  descriptors  of  Table  i 
except  J  do  not  differentiate  sulfur  and  carbon  atoms.  Hence, 
2.5  °C  may  well  be  the  limit  that  such  models  can  attain. 
The  experimental  boiling  points  for  butylmethyl  sulfide  (7) 
and  ethylpropyl  sulfide  (9),  123.2  and  118.5  °C,  respectively, 
differ  by  almost  5  °C.  If  we  overlook  the  difference  between 
sulfur  and  carbon,  both  these  structures  have  the  same 
molecular  graph.  The  same  is  true  for  ethylisopropyl  sulfide 
(6)  and  isobutylmethyl  sulfide  (8),  with  the  boiling  points 
107.4  and  112.5  °C,  respectively.  Hence,  the  simple  con¬ 
nectivity  index  and  other  topological  indices  that  do  not 
discriminate  heteroatoms  can  at  best  approach  the  standard 
error  of  about  2.5  °C. 

Observe  that  the  descriptors  listed  in  Table  1  are  of  quite 
distinct  structural  origin  and  thus  do  not  duplicate  one 
another.  However,  many  of  such  indices,  even  when  com¬ 
bined  (the  right  part  of  Table  1),  apparently  lack  flexibility 
to  represent  the  data  with  desirable  accuracy.  Using  descrip¬ 
tors  that  differentiate  heteroatoms,  we  reach  a  standard  error 
of  about  2  °C.  The  question  to  consider  is  as  follows:  Can 
the  standard  error  of  2  °C  obtained  using  x%  and  J  be  further 
dramatically  reduced?  Have  we  reached  the  limit  for  cor¬ 
relating  the  boiling  points  of  sulfides?  Is  it  that  the  residual 
of  the  molecular  property  considered  cannot  be  described 
by  any  of  the  available  structural  descriptors? 

FLEXIBLE  MOLECULAR  DESCRIPTORS 

In  order  to  develop  a  high-quality  regression,  we  not  only 
need  new  descriptors  but  we  need  a  new  kind  of  molecular 
descriptors  that  have  the  flexibility  to  adjust  to  the  variability 
that  different  molecules  may  show.  One  such  descriptor  has 
been  introduced  in  the  multiple  regression  analysis  10  years 
ago,21-22  but  apparently  has  been  mostly  overlooked.  That 
novelty  can  be  ignored  or  overlooked  has  already  been 
well-illustrated  by  the  Wiener  index  W,  which  waited  two 
decades  to  be  resurrected.  In  order  to  not  repeat  that  history, 
we  undertook  a  concerted  effort  to  illustrate  properties  of 
variable  descriptors,  and  the  variable  connectivity  index,  in 
particular.23"26  The  variable  connectivity  index  represents  an 
important  and  distinct  generalization  of  the  connectivity  index 
lX  since  it  offers  a  flexibility  that  traditional  topological 
indices,  all  several  hundred  of  them,  have  been  lacking. 


High-Quality  Structure-Property-Activity  Regressions 

We  propose  here  a  special  symbol,  V,  for  the  flexible 
connectivity,  index  which  is  to  be  outlined  shortly.  The 
original  connectivity  index  Xx  (named  so  by  Kier  et  al.27), . 
proposed  by  Randic,16  used  a  fixed  number  as  entries  in  the 
weighting  algorithm  1  i(pq)m  for  the  contribution  of  a  bond 
having  p  and  q  neighbors.  The  higher  order  connectivity 
indices,  m%,28  were  defined  analogously  using  paths  of  length 
m,  for  m  =  2,  3,  ....  The  bonding  connectivity  indices,  lxh, 
were  considered  by  Basak  and  Magnuson29  on  the  basis  of 
weights  equal  to  the  number  of  bonds  of  an  atom:  1  for  a 
single  bond,  2  for  a  double  bond,  and  3  for  a  triple  bond. 
The  valence  connectivity  indices,  l^v,  developed  by  Kier  and 
Hall,30  use  the  difference  in  valence  electrons  and  the  number 
of  hydrogen  atoms  to  modify  the  valence  parameter  for 
heteroatoms.  Finally  “edge  connectivity”  indices  were  re¬ 
cently  tested  using  bond  adjacency  rather  than  vertex 
adjacency  in  construction  of  the  modified  connectivity 
indices.31 

All  the  above  indices,  except  lxf,  are  based  on  fixed 
weights  determined  by  the  connectivity  of  the  molecular 
graph  model  used.  In  our  view,  a  better  strategy  is  to 
introduce  weights  that  make  descriptors  “flexible”,  so  not 
only  that  atoms  of  different  type  can  adjust  their  weights  in 
order  to  yield  an  optimal  characterization  of  a  molecule  for 
a  particular  property  but  that  they  may  change  values  when 
different  properties  of  the  same  set  of  molecules  are 
considered.  In  general,  for  a  molecule  with  n  different  types 
of  atoms,  jci,  xz, ...,  jc„,  one  can  have  n  different  weights  Xi  ( i 
=  1,  2,  n);  hence,  the  flexible  connectivity  index  lx* 

becomes  a  function  of  n  variables.  In  the  case  of  sulfides, 
we  consider  two  variables,  the  weights  of  carbon  and  sulfur 
atoms.  In  the  case  of  natural  amino  acids  there  are  four  kinds 
of  atoms:  carbon,  oxygen,  nitrogen,  and  sulfur;  hence,  in 
this  case  flexible  connectivity  indices  V  imply  optimization 
of  four  variables.24  Even  if  there  are  no  heteroatoms,  variable 
weights  can  improve  regressions  visibly.25 

It  should  be  noted  that  while  the  special  types  of 
connectivity  indices,  viz.,  mx ,  mXb>  and  mxv  indices,  explore 
only  local  regions  of  the  parameter  space,  the  mxf  indices 
are  capable  of  exploring  the  full  potential  of  the  parameter 
space  generated  by  the  presence  of  heteroatoms  in  a 
molecule.  The  previously  mentioned  simple  connectivity 
indices  and  valence  connectivity  indices  can  be  viewed  as  a 
special  case  of  the  more  general  flexible  indices  Mxf. 
Consequently,  the  flexible  indices  mxf  are  expected  to  be 
more  powerful  in  predicting  molecular  properties  and 
biological  activities. 

Besides  the  weighted  connectivity  indices,21'26  many  other 
topological  indices,  e.g.  the  weighted  paths  ptf,32"34  the 
weighted  walks,  w*f,  the  weighted  Hosoya  index  Zf,  the 
weighted  Wiener  index  Wf,  and  the  weighted  Balaban  index 
7f,  can  be  generalized  in  a  similar  way.35  Except  for  a  half- 
dozen  papers  of  the  present  authors,21'26,32'34  use  of  variable 
molecular  descriptors  is  in  its  infancy. 

Dramatic  improvement  in  the  quality  of  regressions  was 
obtained  by  using  variable  connectivity  indices.  For  example, 
by  introducing  a  variable  parameter  x  for  chlorine  in 
clonidine  and  clonidine-like  imidazolidines  (2-(aryIimino)- 
imidazolidines),21  the  value  x  =  -0.20  for  chlorine  produces 
a  regression  which,  with  three  weighted  connectivity  indices, 
gave  better  results  for  the  set  of  clonidine  compounds  as 
compared  to  five  descriptors  used  in  a  traditional  QSAR.36 
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Figure  2.  Molecular  graph  of  ethyl  isopropyl  sulfide  and  the 
corresponding  numbering  of  atoms  used  in  Table  2. 


Table  2.  Adjacency  Matrix  and  Modified  Adjacency  Matrix  for 
Ethyl  Isopropyl  Sulfide 


adjacency 

matrix 

row  sum 

1  2  3 

4  5  6 

1 

0  1  0 

0  0  0 

1 

2 

1  0  1 

0  0  0 

2 

3 

0  1  0 

1  0  0 

2 

4 

0  0  1 

0  1  1 

3 

5 

0  0  0 

1  0  0 

1 

6 

0  0  0 

1  0  0 

1 

modified  adjacency  matrix 

row  sum 

1 

2  3  4 

5  6 

1 

x  1  0 

0  0  0 

1  +  x 

2 

1  x  1 

0  0  0 

2  +  x 

3 

0  1  y 

1  0  0 

2  +  y 

4 

0  0  1 

x  1  1 

3  +  x 

5 

0  0  0 

1  x  0 

1  +x 

6 

0  0  0 

1  0  x 

1  +  x 

This  result  is  particularly  striking  for  this  data  set,  because 
there  are  two  extreme  potency  values  which  would  be 
expected  to  give  much  trouble  in  cross-validation.  Use  of 
two  variables  that  differentiate  carbon  and  oxygen  in 
alcohols,  with  x  =  +1.5  and  y  =  -  0.85,  respectively,  reduced 
the  standard  error  of  7  °C,  obtained  using  the  simple 
connectivity  index  that  does  not  differentiate  carbon  and 
oxygen  atoms,  to  3.5  °C.22  In  the  case  of  amines,  the  standard 
error  of  3.48  °C  for  the  boiling  point  model  when  l%  is  used 
has  been  reduced  to  1.91  °C  with  x  =  +1.25  and  y  = 
-0.65.23  The  standard  error  for  a  quadratic  regression  using 
the  connectivity  index  for  the  boiling  points  of  smaller 
alkanes  is  2.98  °C.  When  x  ~  +0.65  is  introduced  as  a 
weight,  not  only  is  s  =  2.48  obtained,  a  reduction  by  a  half¬ 
degree  Celsius,  but  higher  precision  allowed  the  recognition 
of  an  outlier  (with  an  error  of  over  6  °C),  which,  when 
eliminated,  further  reduced  the  standard  error  to  an  impres¬ 
sive  1.57  °C.25 

OPTIMAL  DESCRIPTORS  FOR  SULFUR 

We  will  examine  the  correlation  of  the  boiling  points  for 
sulfides  of  Figure  1  using  functional  molecular  descriptors 
and  will  illustrate  the  use  of  a  variable  connectivity  index 
by  considering  ethyl  isopropyl  sulfide  (shown  in  Figure  2 
with  the  numbering  of  the  atoms  used).  The  adjacency  matrix 
and  the  modified  adjacency  matrix  of  ethyl  isopropyl  sulfide 
are  illustrated  in  Table  2.  If  we  assume  x  =  0  and  y  =  0,  we 
obtain  the  usual  adjacency  matrix  of  a  graph  from  the  row 
sums  of  which  the  simple  connectivity  index  can  be  directly 
computed.  To  obtain  the  bond  contribution  for  we  use 
the  algorithm  1  f(p  q)m.  Here  m  and  n  are  the  respective 
valences  as  obtained  from  the  row  sums  for  atoms  m  and  n 
forming  the  bond  (p,  q).  When  x  *  0  and  y  ^  0,  the 
corresponding  row  sums  are  modified,  and  instead  of  the 
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Table  3.  Modified  Connectivity  Index  x%  for  Ethyl  Isopropyl 
Sulfide  with  Different  Choices  of  x  and  y 


X 

'x  (*.  y) 

X 

y 

'x  (*.  y) 

0 

-1.00 

4.392  51 

+0.25 

-0.95 

2.780  49 

0 

-1.20 

3.297  87 

0 

0 

2.770  06 

0 

-1.00 

3.146  26 

+  0.25 

-0.90 

2.753  09 

0 

-0.95 

3.11531 

0 

+  0.50 

2.674  17 

0 

-0.90 

3.086  49 

+0.50 

-1.00 

2.556  25 

0 

-0.75 

3.010  66 

+  0.50 

-0.95 

2.528  12 

0 

-0.50 

2.910  56 

+  1.00 

-1.00 

2.192  71 

0 

-0.25 

2.832  77 

+  2.00 

-1.00 

1.752  29 

+0.25 

-1.00 

2.809  93 

fixed  valences  p,  q,  we  have  the  variable  valence  (p  +  x), 
(q  +  jc),  or  (q  +  y),  depending  on  the  kind  of  atoms  involved. 
Thus  instead  of  the  simple  (“fixed”)  connectivity  index  x% 
=  1/a/ 2  +  1/2  +  ’  1  /a/ 6  +  2//3,  we  have  the  variable 
connectivity  index  given  as  a  function  of  two  variables: 

'x  (*,*)  =  i/{(i  +  x)(2  +  jc)}i/2  + 

l/{(2  +  x)(2  +  y)}l/2  +  l/{(3  +  jc)(2  +  y)}1/2  + 

2/{(l  +  -i)(3  +  jc)}i/2 

In  Table  3  we  listed  selected  values  of  the  variable  lx 
molecular  descriptor  for  ethyl  isopropyl  sulfide.  As  we  see, 
the  flexible  descriptor  is  sensitive  on  the  choice  of  the  values 
for  x  and  y.  For  a  fixed  value  of  x  (carbon  atom),  as  y 
decreases  and  approaches  - 1,  the  magnitudes  of  the  modified 
connectivity  index  increase.  Similarly  for  a  fixed  value  of  y 
as  x  increases  the  magnitude  of  the  modified  connectivity 
index  decreases.  An  increase  and  a  decrease  of  the  modified 
index  is  not  so  important  as  is  the  change  of  the  relative 
magnitudes  of  the  indices  for  different  molecules. 

In  Table  4  we  have  listed  the  expressions  for  the  modified 
connectivity  indices  for  the  set  of  n  =  21  sulfides.  In  order 
to  illustrate  the  flexibility  of  these  generalized  connectivity 
indices  in  Table  5,  we  listed  for  the  selected  values  of  x  and 
y  the  numerical  values  for  the  variable  connectivity  indices. 
Even  though  for  most  of  the  structures  the  numerical 
magnitudes  have  not  reversed  the  relative  magnitudes,  they 
altered  the  magnitudes  of  the  indices  for  different  molecules 
sufficiently  to  influence  the  quality  of  the  regression  dramati¬ 
cally.  The  ratios  of  the  magnitudes  of  descriptors  for  different 
molecules  are  important  for  MRA,  and  these  do  change. 
Consider  isopropyl  propyl  sulfide  (14)  and  ethyl  isobutyl 
sulfide  (15)  with  the  boiling  points  132.0  and  134.2  °C, 
respectively.  As  we  can  see  from  Table  5  when  x  =  -V2, 
and  y  =  —  1,  the  modified  connectivity  indices  are  as 
follows:  5.059  17  and  5.092  95,  giving  the  quotient  0.9934. 
However,  when  x  ~  -H/2  and  y  =  -1  the  modified 
connectivities  are  as  follows:  2.956  25  and  2.992  24,  and 
the  quotient  decreases  to  0.9880.  These  changes  may  appear 
small;  however,  they  are  sufficient  enough  to  influence  the 
standard  error  and  make  one  alternative  better  than  the  other. 
When  such  changes  are  summed  for  all  molecules,  consider¬ 
able  improvement  in  the  overall  standard  error  is  possible. 

In  Table  6  we  show  the  standard  error  as  a  function  of 
the  parameters  x,  y,  assuming  a  quadratic  regression  using  n 
=  19  compounds.  We  excluded  two  structures,  ethyl  butyl 
sulfide  12  and  diisopropyl  sulfide  20,  to  be  discussed  later. 
Using  the  simple  connectivity  index,  the  (0,  0)  point  in  Table 
6,  the  standard  error  is  quite  respectable  2.71  °C.  Neverthe¬ 
less  this  is  about  twice  the  magnitude  of  typical  experimental 
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errors  reported  for  boiling  points  of  organic  compounds  (1  - 
1.5  °C).  By  keeping  x  constant  and  varying  y,  we  see  a 
dramatic  reduction  of  the  standard  error  as  we  approach  the 
y  =  - 1  limit.  The  standard  error  for  x  =  0  and  y  =  - 1  is 
about  1.5  °C  smaller  than  the  initial  value  (x  =  y  =  0).  With 
a  further  change  of  both  parameters  x  and  y,  we  find  the 
minimum  standard  error  of  1.326  °C  (when  x  =  +0.25  and 
y  =  -0.95).  This  is  less  than  half  of  the  initial  standard  error 
characterizing  the  “inflexible”  connectivity  index. 

OUTLIERS 

Mathematical  descriptors,  if  correctly  calculated,  are  error- 
free.  Hence,  if  in  a  correlation  between  an  experimental 
quantity  and  mathematical  descriptors  of  one  or  more  points 
show  larger  deviation  from  the  regression  curve,  this  can 
mean  two  things:  Either  (1)  some  experimental  data  used 
are  in  error  or  (2)  the  descriptors  used  fail  to  capture  some 
relevant  structural  feature  present  in  some  (and  absent  in 
other)  molecules. 

Whatever  is  the  reason  for  the  departure  of  a  point  from 
the  regression  line,  one  can  consider  such  a  point  as  an  outlier 
if  the  departure  from  the  correlation  is  more  than  twice  the 
standard  error.  In  Figure  3  we  show  the  quadratic  correlation 
for  sulfides,  and  in  Table  7  we  listed  the  computed  boiling 
point  and  the  residue.  As  we  see  from  Table  7  ethyl  butyl 
sulfide  and  diisopropyl  sulfide  show  large  departures  from 
the  regression.  In  Table  8  are  given  the  regression  equations 
and  the  associated  statistical  parameters  for  all  n  =  21 
sulfides  as  well  as  for  the  cases  n  =  19  sulfides  where  two 
outliers  have  been  removed  respectively  from  the  set 
considered. 

By  eliminating  the  apparent  outliers  (12  and  20),  one 
substantially  reduces  the  standard  error  for  the  quadratic 
model,  as  can  be  seen  from  the  bottom  part  of  Table  8.  The 
standard  error  for  the  regression  when  n  =  19  reaches  the 
respectable  value  of  1.33  °C  and  the  correlation  coefficient 
and  the  Fisher  ratio  have  increased.  This  signals  that  the 
model  has  improved  and  that  we  were  justified  in  eliminating 
the  two  outliers. 

In  Table  9  we  listed  the  optimal  connectivity  indices  for 
the  sulfides  considered,  the  experimental  boiling  points  (BP), 
the  calculated  boiling  points  (BPcalc),  the  residual  of  the 
regression  (Res),  the  cross-validated  boiling  points  (xBP- 
calc),  and  the  standard  error  associated  with  cross-validation 
(when  leaving  one  entry  out).  For  the  two  outliers,  ethyl  butyl 
sulfide  and  diisoproyl  sulfide,  which  were  excluded  when 
the  regression  equation  was  derived,  we  calculate  for  the 
boiling  points  to  be  140.44  and  124.47  °C,  respectively.  The 
first  of  these  values  is  about  4  °C  below  the  reported 
experimental  BP;  the  second  value  is  almost  4.5  °C  higher 
than  the  reported  experimental  BP.  The  quadratic  regression 
without  the  data  on  the  two  outliers  is  illustrated  in  Figure 
4. 

A  closer  look  at  the  last  column  of  Table  9,  which  lists 
the  standard  errors  associated  with  the  cross-validated 
regressions,  shows  (with  a  single  exception  13,  dipropyl 
sulfide)  that  the  cross-validated  standard  errors  differ  about 
±0.05  °C  from  the  standard  error  of  the  regression  (when 
all  n  =  19  compounds  are  considered).  Hence,  disregarding 
the  exception  which  produced  significantly  smaller  standard 
error,  the  constancy  of  the  cross-validated  standard  errors 
show  the  robustness  of  this  particular  regression. 
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Table  4.  Generalized  Flexible  Connectivity  Indices  for  n  =  21  Sulfides  (of  Figure  I) 


1  2/{(l  +  x)(2  +  y)}l/2 

2  1/{(1  +  x)(2  +  x)}m  +  !/{( 2  +  jc)(2  +  y)}1/2  +  I/((l  +  x)(2  +  y)}m 

3  1/{(I  +  x)(2  +  x)}m  +  1  l(x  +  2)  +  l/{  (2  +  x)(2  +  y)}m  +  I/{(1  +  x)(2  +  y)}m 

4  l/(x+2)  +  1  /{ (2  +  x)(2  +  y)},/2 

5  2/{(l  +  x)(3  +  x)}'n  +  1  /{ (3  +  x)(2  +  y)}1/2  +  1/{(1  +  x)(2  +  y)}l/2 

6  2/{(l  +  x)(3  +  x)}l/2  +  l/{(3  +  x)(2  +  y)}1/2  +  l/{(2  +  x)(2  +  y)}m  +  I/{(1  +  x)(2  +  x)},/2 

7  1/{(1  +  x)(2  +  x)}m  +  2/(2  +  x)  +  l/{  (2  +  x)(2  +  y)}in  +  1/{(1  +  x)(2  +  y)}'n 

8  2/{(l  +  x)(3  +  x)}  +  l/{(3  +  x)(2  +  x)}m  +  l/{(2  +  x)(2  +  y)}'12  +  1/{(1  +  x)(2  +  y))m 

9  2/{(l  4*  x)(2  +  x)}  +  2/(2  +  x)  +  2/{(2  +  x)(2  +  y)}in 

10  3/{(l  +  x)(4  +  x)}m  +  1  /{ (4  +  x)(2  +  y)}1/2  +  1/{(1  +  x)(2  4-  y))m 

11  1/{(1  +  x)(2  +  x)}112  +  3/(2  +  x)  +  l/{(2  +  x)(2  +  y)}m  +  1/{(1  +  x)(2  +  y)}m 

12  2/{(l  +  x)(2  +  x)},/2  +  2/(2  +  x)  +  2/{  (2  +  x)(2  +  y)}m 

13  2/{(l  +  x)(2  +  x)},/2  +  2/(2  +  x)  +  2/{(2  +  x)(2  +  y)}m 

14  2/{(I  +  x)(3  +  x)}'12  +  1/(2  +  x)  +  1  /{ (3  +  x)(2  +  y)}m  +  l/{(2  +  x)(2  4-  y)}m  +  1/{(1  +  x)(2  4-  y)}1/2 

15  2/{(l  4-  x)(3  4-  x)}m  4-  l/{(3  4-  x)(2  +  x)}m  4-  2/{(2  +  x)(2  4-  y)}m  +  1/{(1  4*  x)(2  +  y)}m 

16  2/{(l  -F  x)(3  4-  x)}m  +  1/(2  4x)4  l/{(3  4-  x)(2  +  x)}m  +  l/{(2  +  x)(2  4-  y)}m  +  1/{(1  +  x)(2  4-  y )}'* 

17  -  1/{(1  +  x)(2  4-  x)}m  4-  1/{(1  4-  x)(3  4*  x)}m  +  2/{(2  4-  x)(3  4*  x)}m  +  l/{(2  +  x)(2  4-  y)}  +  1/{(1  +  x)(2  4-  y)}m 

18  2/{(l  4-  x)(2  +  x)}m  4-  1/{(1  4-  x)(3  +  x)}m  4-  l/{(2  +  x)(3  4-  x)}m  4-  l/{(3  4-  x)(2  4-  y)},/2  4-  l/{(2  4-  x)(2  +  y)}m 

19  '  1/{(1  +  x)(2  4*  x)}1/2  4-  3/{(l  4-  x)(3  4*  x)}m  +  l/{(2  4*  x)(2  4-  y)}112  +  I/{(4  4-  x)(2  4-  y)}m 

20  -  4/{(l  4-  x)(3  H-  x)},/2  +  2/{(3  -h  x)(2  + 

21  _ l/{ (1  4~  x)(2  4-  x)}m  4-  1/{(1  4-  x)(3  4-  x)}m  +  l/{(2  4*  x)(3  4-  x)}m  4-  l/{(2  4-  x)(2  4*  y)}m  4-  1/{(1  +  x)(2  4-  y)}lf 2 


Table  5.  Modified  Connectivity  Index  l%  for  Sulfide  for  a  Selection 
of  Choices  of  x  and  y 


(0.0) 

(0, -c 

1.5) 

(0,- 

i) 

(-0.5, 

-1)  (+0.5, 

-i) 

(+i.- 

-i) 

I 

1.414 

21 

1.632 

99 

2.000 

00 

2.828 

43 

1.632 

99 

1.414 

21 

2 

1.914 

21 

2.100 

95 

2.414 

21 

3.385 

41 

1.965 

35 

1.692 

71 

3 

2.414 

21 

2.600 

95 

2.914 

21 

4.052 

08 

2.365 

35 

2.026 

04 

4 

2.414 

21 

2.568 

91 

2.828 

23 

3.942 

39 

2.297 

71 

1.971 

20 

5 

2.270 

06 

2.442 

60 

2.732 

05 

3.835 

52 

2.223 

89 

1.914 

21 

6 

2.770 

06 

2.910 

56 

3.146 

26 

4.392 

51 

2.556 

25 

2.192 

71 

7 

2.914 

21 

3.100 

95 

3.414 

21 

4.718 

74 

2.765 

35 

2.359 

37 

8 

2.770 

06 

2.956 

80 

3.270 

06 

4.535 

96 

2.659 

89 

2.289 

24 

9 

2.914 

21 

3.068 

91 

3.328 

43 

4.609 

06 

2.697 

71 

2.304 

53 

10 

2.560 

66 

2.724 

74 

3.000 

00 

4.216 

52 

2.442 

60 

2.103 

00 

11 

3.414 

21 

3.600 

96 

3.914 

21 

5.385 

41 

3.165 

35 

2.692 

71 

12 

3.414 

21 

3.568 

91 

3.828 

43 

5.275 

37 

3.097 

71 

2.637 

86 

13 

3.414 

21 

3.568 

91 

3.828 

43 

5.275 

37 

3.097 

71 

2.637 

86 

14 

3.270 

06 

3.410 

56 

3.646 

26 

5.059 

17 

2.956 

25 

2.526 

04 

15 

3.270 

06 

3.424 

76 

3.684 

27 

5.092 

95 

2.992 

24 

2.558 

73 

16 

3.270 

06 

3.456 

80 

3.770 

06 

5.202 

63 

3.059 

89 

2.613 

57 

17 

3.308 

06 

3.494 

80 

3.808 

06 

5.312 

63 

3.077 

91 

2.623 

61 

18 

3.308 

06 

3.448 

57 

3.684 

27 

5.169 

18 

2.974 

27 

2.536 

08 

19 

3.060 

67 

3.192 

71 

3.414 

21 

4.773 

51 

2.774 

96 

2.381 

50 

20 

3.125 

90 

3.252 

21 

3.464 

10 

4.842 

62 

2.814 

79 

2.414 

21 

21 

3.346 

07 

3.518 

61 

3.808 

06 

5.388 

87 

3.059 

94 

2.600 

95 

Table  6. 

Standard  Error  of  the  Regression  for  1 

Different  Choices 

of 

the 

Variable  1 

Parameters 

;  x  and 

y 
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•F0.25 

+0.50  +1 

+2 
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273 
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2. 

711 
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0.25 

2. 

363 
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1. 

966 

- 
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1. 

558 
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1. 

382 

1.347 

- 

0.95 

1. 

356 

1.326 

1.380 

- 

1 

2. 

.256 

1. 

357 

1.327 

1.327 

1. 

570 

2.042 

- 

1.2 

1. 

720 

We  believe  that  it  may  be  possible  to  further  improve  the 
regression.  A  close  inspection  of  residuals  shows,  with  very 
few  exceptions,  that  all  linear  structures  have  positive 
residual,  while  all  branched  structures  show  a  negative 
residual.  This  suggests  the  possibility  for  further  reduction 
of  the  standard  error  (particularly  if  the  exceptions  are  viewed 
as  outliers).  However,  such  refinements  should  be  attempted 
when  a  larger  set  of  compounds  is  considered  in  order  to 
see  if  the  observed  trend  is  genuine  or  not. 

Finally,  as  a  warning,  we  should  add  that  when  using 
flexible  descriptors,  elimination  of  outliers  may  influence 


Figure  3.  3.  Quadratic  regression  for  the  boiling  points  of  n  =  21 
sulfides  against  the  optimal  connectivity  index  (x  =  4-0.25,  y  = 
-0.95). 


Table  7.  Calculated  Boiling  Points  (BPcalc)  and  the  Residual  of 
the  Regression  (Res),  When  All  n  =  21  Sulfides  Are  Considered 


BP 

BPcalc 

Res 

1 

37.3 

38.44 

-1.14 

2 

66.6 

65.53 

+  1.07 

3 

95.5 

94.86 

+0.64 

4 

92.0 

90.42 

+  1.58 

5 

84.4 

84.81 

-0.41 

6 

107.4 

108.01 

-0.61 

7 

123.2 

121.09 

+2.11 

8 

112.5 

114.14 

-1.64 

9 

118.5 

117.14 

+  1.36 

10 

101.5 

100.04 

+  1.46 

11 

145.0 

144.21 

+0.79 

12 

144.2 

140.75 

+3.45 

13 

142.8 

140.75 

+  2.05 

14 

132.0 

132.73 

-0.73 

15 

134.2 

134.52 

-0.32 

16 

137.0 

138.12 

—  1.12 

17 

139.0 

139.40 

-0.40 

18 

133.6 

134.05 

-0.45 

19 

120.4 

121.82 

-1.42 

20 

120.0 

124.31 

-4.31 

21 

137.0 

138.94 

-1.94 

the  optimal  values  for  the  parameters  x,  y,  though  not 
necessarily  dramatically. 


CONCLUDING  REMARKS 

Several  criticisms  could  be  raised  concerning  the  outlined 
work:37  Is  it  appropriate  to  refer  to  MRA  using  flexible 


904  J.  Chem.  Inf  Comput.  Sci.,  Vol.  40,  No .  4,  2000 


Randic  and  Basak 


Table  8.  Linear  and  Quadratic  Regressions  for  Sulfides" 
n  model  coeff  x  coeff  x7  constant  r  s  F 


21  linear  60.1981  -61.3339  0.9959  2.61  2291 

21  quadratic  102.8180  -7.8615  -  117.0919  0.9981  1.83  2328 

21  orthogonal  60.1981  -7.8615  -61.3339  0.9981  1.83  2328 

19  linear  60.1057  -60.9916  0.9961  2.59  2180 

19  quadratic  108.9647  -9.0423  -124.6847  0.9990  1.33  4192 

19  orthogonal  60.1057  -9.0423  -60.9916  0.9990  1.33  4192 


"The  top  part  gives  the  regression  equations  and  the  statistical 
parameters  for  all  n  =  21  sulfides;  the  bottom  part  gives  the  equations 
when  two  outliers  are  excluded. 


Table  9.  Optimal  Connectivity  Indices  for  the  Sulfides  Considered, 
the  Experimental  Boiling  Points  (BP),  the  Calculated  Boiling  Points 
(BPcalc),  the  Residual  of  the  Regression  (Res),  the  Cross-Validated 
Boiling  Points  (xBPcalc),  and  the  Standard  Error  of 
Cross-Validated  Boiling  Points 


(+0.25,  -0.095) 

BP 

BPcalc 

Res 

xBPcalc 

xstd  error 

1 

1.745  75 

37.3 

37.98 

-0.68 

40.65 

1.31 

2 

2.11976 

66.6 

65.66 

+0.94 

65.41 

1.34 

3 

2.564  20 

95.5 

95.27 

+0.23 

95.23 

1.37 

4 

2.493  37 

92.0 

90.82 

+  1.18 

90.60 

1.33 

5 

2.406  48 

84.4 

85.17 

-0.77 

85.30 

1.35 

6 

2.780  49 

107.4 

108.38 

-0.98 

108.51 

1.34 

7 

3.008  65 

123,2 

121.30 

+  1.90 

120.86 

1.27 

8 

2.885  55 

112.5 

114.45 

-1.95 

114.66 

1.26 

9 

2.938  21 

118.5 

117.41 

+  1.07 

117.31 

1.34 

10 

2.647  84 

101.5 

100.44 

+  1.06 

100.28 

1.38 

11 

3.453  09 

145.0 

143.76 

+  1.24 

143.42 

1.32 

12 

3.382  66 

144.2 

13 

3.382  66 

142.8 

140.44 

+  2.36 

138.86 

1.20 

14 

3.224  94 

132.0 

132.68 

-0.68 

132.74 

1.38 

15 

3.259  56 

134.2 

134.42 

-0.22 

134.44 

1.37 

16 

3.329  99 

137.0 

137.90 

-0.90 

138.01 

1.35 

17 

3.355  50 

139.0 

139.14 

-0.14 

139.16 

1.37 

18 

3.250  44 

133.6 

133.96 

-0.36 

134.00 

1.37 

19 

3.021  85 

120.4 

122.02 

-1.62 

122.16 

1.30 

20 

3.067  22 

120.0 

21 

3.346  37 

137.0 

138.69 

-1.69 

138.94 

1.29 

160 
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Figure  4.  4.  Quadratic  regression  for  the  boiling  points  of  n  =  19 
sulfides  against  the  optimal  connectivity  index  ( x  =  +0.25,  y  = 
-0.95).  Outliers  excluded  12  and  20. 

descriptors  as  “high-quality  regression”,  or  should  it  be  called 
“high  specialty  SAR”?  Is  one  justified  to  arrive  at  low 
standard  error  by  “trimming  the  data  set  and  by  tweaking 
the  descriptor”?  Would  the  model  be  any  good  to  predict 
boiling  points  even  for  other  sulfides?  Is  the  approach  general 
enough  and  sufficiently  justified  if  we  were  to  use  QSAR 
models  for  real  world  problems?  Why  not  consider  more 
extensive  study  on  a  larger  set  of  data  to  strengthen  the  case? 
What  is  the  use  of  a  model  developed  by  considering  a  quite 
small,  homogeneous  set  of  compounds?  Is  developing  a  fit 
with  standard  error  less  than  that  of  the  experimental  error 
(if  that  can  be  achieved)  overfitting? 


We  respond  to  these  question  one  by  one.  Variable 
connectivity  indices  (and  related  variable  indices)  constitute 
a  general  class  of  descriptors  as  compared  to  the  special  class 
of  descriptors  used  in  QSAR  (e.g.  indicator  variables  used 
in  some  QSAR,  or  hydrogen  bonding  descriptors  used  in 
CODESSA)  for  which  the  attribute  “high  specialty”  holds. 
Concerning  the  problem  of  identifying  outliers,  these  are 
well-defined  as  points  that  are  beyond  2  standard  deviations. 
There  are  no  good  reasons  for  their  inclusion  in  the  data  set, 
despite  that  their  departure  from  the  regression  need  not  be 
due  to  experimental  error.  Most  often  they  are  not.  The 
occurrence  of  outliers  may  be  a  signal  that  the  set  of 
descriptors  used  to  characterize  molecules  failed  to  charac¬ 
terize  some  special  structural  features  which  are  important 
for  outliers  but  not  for  most  of  other  molecules  in  the  set.  A 
close  look  at  outliers  may  help  one  to  recognize  such  features, 
if  they  are  not  obvious.  For  example,  correlation  of  the 
boiling  points  of  smaller  alkanes25  shows  only  2, 2,3,3- 
tetramethylbutane  was  identified  as  an  outlier  (with  deviation 
of  over  6  °C),  while  the  standard  error  was  2.48  °C.  By 
removing  this  compound,  standard  error  dropped  to  2  °C. 
Hence,  a  single  compound  in  a  set  of  20  was  able  to  increase 
the  standard  error  almost  by  xk  °C.  Why  should  this 
compound  that  has  additional  structural  features  (significant 
overcrowding  of  methyl  groups  and  a  quaternary  CC  bond) 
absent  in  the  rest  be  included  if  one  is  interested  in  predicting 
the  boiling  point  of  a  compound  which  has  no  overcrowded 
methyl  groups  and  no  quaternary  CC  bond? 

Smaller  sulfides  considered  (and  the  same  has  been  the 
case  with  smaller  alkanes  or  amino  acids)  are  molecules  of 
similar  size.  To  consider  large  selection  of  compounds 
necessarily  brings  the  dominant  role  of  molecular  size  into 
focus  as  important  feature.  Before  we  do  this,  we  should 
investigate  to  what  extent  the  variable  weights  may  depend 
on  the  size  of  the  molecule.  At  the  moment  this  is  an 
unresolved  problem,  which  is  the  main  reason  for  restricting 
attention  to  smaller  sets  of  compounds  with  similar  size.  We 
should  add  that  it  is  not  uncommon  in  QSAR  to  consider 
smaller  sets  of  compounds,  often  because  of  limited  data. 
For  example  in  a  recent  review  of  comparative  QSAR 
Hansch  and  co-workers38,39  gave  results  for  189  regressions 
in  which  only  33  had  more  than  20  compounds  in  the  set, 
and  156  had  less  than  20  compounds,  that  is,  less  than  the 
number  of  sulfides  considered  in  this  paper.  If  compounds 
are  well-selected,  the  resulting  regressions  may  be  of  interest. 
We  gave  here  the  results  for  smaller  sulfides.  If  one  is 
interested  in  larger  sulfides,  one  should  select  those,  and  if 
one  is  interested  in  all  sulfides,  one  should  combine  them 
all.  But  again  a  question  can  be  raised:  If  one  is  interested 
in  predicting  the  boiling  point  of  smaller  sulfides,  why  docs 
one  need  information  of  compounds  that  are  twice  its  size? 
It  is  a  matter  of  philosophy,  and  while  we  appreciate  the 
merits  of  studying  a  large  data  basis,  we  also  appreciate  the 
advantages  of  studying  small  homogeneous  sets  of  com¬ 
pounds.  Such  a  study  focuses  attention  at  different  aspects 
of  structural  chemistry.  In  fact,  one  of  the  present  author 
made  numerous  studies  on  the  large  set  of  compounds  using 
diverse  types  of  molecular  descriptors.40-45 

Concerning  “overfitting”,  which  is  clearly  undesirable,  we 
would  like  to  point  out  that  this  is  out  of  the  question  when 
one  uses  a  single  descriptor.  Overfitting  is  a  danger  in 
multiple  regression  analysis  when  one  uses  too  many 
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descriptors  and  has  too  few  data.  One  cannot  have  overfitting 
with  a  single  descriptor.  This  problem  received  some 
attention.46  Does  the  variation  of  descriptors  during  the 
regression  poses  such  a  threat?  Definitely  so,  just  as  a 
selection  of  descriptors  from  a  large  pool  of  descriptors  (e.g. 
in  CODESSA  software)  does  the  same.  The  difference 
between  the  two  is  that  typically  when  using  variable 
connectivity  index,  one  generates  about  40  different  numer¬ 
ical  alternative  descriptors  to  choose  from,  CODESSA 
typically  chooses  a  half-dozen  descriptors  from  a  pool  of 
some  400  descriptors! 

Finally  we  have  to  emphasize  that  while  the  idea  of 
modifying  chemical  graph  descriptors  to  differentiate  het¬ 
eroatoms  is  not  new*  as  is  well-illustrated  by  the  pioneering 
work  of  Kier  and  Hall  on  valence  connectivity  indices,28  the 
idea  of  modifying  chemical  graph  descriptors  to  differentiate 
heteroatoms  during’ the  search  for  the  best  regression;  that 
is,  the  idea  of  variable  topological  indices,  is  new. 
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We  consider  construction  of  optimal  molecular  descriptors  to  be  used  for  multiple  regression  analysis 
of  several  properties  of  alcohols.  The  descriptors  are  obtained  by  considering  shorter  paths  with  vari¬ 
able  weight  x  for  carbon-oxygen  bond  in  alcohol.  In  particular  we  consider  as  molecular  descriptors 
paths  of  length  1,  2  and  3.  The  multiple  regression  analysis  of  the  following  molecular  properties  was 
examined:  —  log  S  (S  =  solubility),  CSA  (cavity  surface  area),  log  P  (P  =  octanol/water  partition), 
and  log  7  (7  =  infinite  solution  activity  coefficient).  By  minimizing  the  standard  error  of  the  regres¬ 
sion  for  each  property  we  found  optimal  variable  weight. 

Keywords :  Variable  molecular  descriptors;  weighted  paths;  MRA;  orthogonal  descriptors;  alcohol 
properties 


INTRODUCTION 

Study  of  structure-property  and  structure-activity  relationship  continues  to  attract 
considerable  attention  in  chemical  literature.  Various  statistical  methods  have 
been  found  useful  in  such  studies,  including  the  Principal  Component  Analysis 
(PCA)  [1],  the  Pattern  Recognition  (PR)  [2],  the  Partial  Least  Square  method 
(PLS)  [3],  the  Artificial  Neural  Networks  (ANN)  [4].  The  oldest  data  reduction 
method,  the  Multiple  Regression  Analysis  (MRA)  [5],  continues  to  be  widely 
used.  Most  applications  of  MRA  to  QSAR  and  SAR  can  be  classified  into  one 
of  two  types: 
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(I)  Examination  of  large  number  of  diverse  and  heterogeneous  structures; 

(II)  Study  of  smaller  number  of  homogenous  structures. 

Each  of  these  studies  have  their  merits  and  will  continue  to  be  pursued.  In 
both  cases  often  one  starts  screening  a  large  pool  of  molecular  descriptors  from 
which  one  selected  smaller  number  of  descriptors  that  are  used  for  construction 
,of  regression  equations,  or  construction  of  principal  components.  An  alterna¬ 
tive,  particularly  suitable  when  one  study  smaller  number  of  structurally  related 
compounds,  is  to  focus  attention  on  only  few  molecular  descriptors  which  are 
general  enough  to  be  used  in  different  applications  [6,  7],  Such  descriptors 
were  referred  to  as  basis  descriptors  in  analogy  with  basis  vectors  in  linear 
algebra.  Advantage  of  basis  descriptors  is  that  they  facilitate  comparative  anal¬ 
ysis,  because  the  same  descriptors  are  used  in  different  applications,  for  different 
molecules  and  different  properties.  For  example,  Kier  and  Hall  [8]  used  different 
combinations  of  the  connectivity  indices  for  the  best  correlation  of  alkane  heats 
of  atomization  and  alkane  heats  of  formation.  If,  however,  one  restrict  search  for 
best  correlation  for  the  two  properties  to  the  same  connectivity  indices  one  finds 
that  the  two  properties  are  strictly  collinear,  the  fact  that  is  obscured  when  one 

uses  different  descriptors  because  the  two  samples  of  structures  are  somewhat 
different. 

Despite  its  wide  use  MRA  was  viewed  by  some  as  deficient,  because  as  a  rule 
introduction  of  an  additional  descriptor  in  the  analysis  causes  dramatic  changes 
of  the  contributions  of  already  used  descriptors.  Because  of  this  pronounced 
instability  of  the  regression  equations  it  is  not  possible  to  interpret  the  results  in 
terms  of  the  relative  role  of  the  descriptors  used.  This  deficiency  (which  inciden¬ 
tally  is  not  confined  solely  to  MRA)  has  been  traced  to  mutual  interrelation  of 
descriptors  [9-13],  If  the  descriptors  used  are  to  a  greater  extend  independent  of 
one  another  one  observes  but  a  minor  variations  of  the  coefficients  of  the  regres¬ 
sion  equation  if  a  descriptor  is  included  or  excluded.  However  use  of  moderately 
and  highly  intercorrelated  descriptors,  which  often  cannot  be  avoided,  results  in 
pronounced  instability  of  the  regression  equation.  This  is  particularly  visible 
when  one  introduces  descriptors  one  at  a  time  in  a  stepwise  regression. 

This  very  unsatisfactory  affair  has  been  tolerated  because  despite  the  insta¬ 
bility  of  the  regression  equations  each  additional  relevant  descriptor  decreases 
the  standard  error  of  prediction  for  the  property  considered.  Thus  the  equation 
offers  useful  predictions  but  it  does  not  offer  useful  interrelation.  This  MRA 
nightmare  —  as  some  have  referred  to  it  —  is  no  more.  With  introduction  of 
orthogonalization  procedure  for  molecular  descriptors  not  only  that  the  regres¬ 
sion  equation  becomes  stable  but  the  error  of  the  coefficients  reduces  with  intro¬ 
duction  of  each  additional  relevant  descriptor  [12].  While  some  have  recognized 
the  significance  of  using  orthogonal  molecular  descriptors  [14-16]  apparently 
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others  still  do  not  appreciate  or  are  unaware  of  the  novel  situation,  which  for 
the  first  time  makes  possible  to  interpret  the  relative  contributions  of  descrip- 
tors  used. 

We  will  refer  to  MRA  using  molecular  descriptors  as  MORA,  the  Multi¬ 
variate  Orthogonal  Regression  Analysis.  It  has  been  shown  that  MORA  and 
MRA  remain  related  so  that  one  can  obtain  orthogonalized  regression  equation 
form  MRA  by  stepwise  regression  [9,  10].  With  this  remedy  MRA  not  only 
remains  a  very  viable  data  reduction  method  for  QSAR  and  QSPR,  but  in  some 
way  may  again  become  the  method  of  choice,  despite  the  fact  that  researchers 
in  the  field  are  free  to  be  reluctant  to  use  a  new  method!  In  our  opinion  MORA 
has  an  important  advantage  over  PCA.  MORA,  just  as  PCA,  uses  orthogonal 
descriptors  but  in  contrast  to  PCA  the  descriptors  used  in  MORA  can  be  inter¬ 
preted  in  terms  of  the  structural  meaning  of  the  initial  descriptors.  In  contrast 
the  linear  combinations  that  define  the  principal  components  have,  at  best,  a 
vague  interpretation  (i.e.,  as  bulk,  cohesiveness,  etc.).  Not  only  that  it  is  hard 
to  visualize  what  such  linear  combinations  of  descriptors  represent,  the  descrip¬ 
tors  that  define  the  principal  components  are  themselves  not  orthogonal,  despite 
that  the  principal  components  are  mutually  orthogonal.  So  we  are  in  no  better 
situation,  as  far  as  an  interpretation  of  the  results  of  PCA  is  concerned,  then  we 
have  been  with  MRA  in  the  time  of  instabilities  of  the  regression  equations! 


OPTIMAL  MOLECULAR  DESCRIPTORS 

With  hundreds  of  molecular  descriptors  available  [17-19]  immediately  one  is 
confronted  with  decision  concerning  selection  of  descriptors.  The  choices  to 
consider  are:  (a)  select  a  subset  of  “the  best”  descriptors  from  a  large  pool  of 
available  descriptors;  (b)  use  a  limited  set  (of  “well  ordered”  structurally  related 
descriptors,  the  basis;  (c)  use  as  few  as  possible  descriptors  that  are  suitably 
optimized  for  the  particular  application.  We  will  refer  to  the  last  alternative 
as  use  of  optimal  molecular  descriptors.  In  the  first  case  we  put  “the  best” 
under  quotes  because  the  outcome  will  depend  on  the  criteria  used  to  select 
descriptors.  Current  practice  that  many  adopted  of  excluding  descriptors  that 
are  highly  intercorrelated  to  descriptors  already  selected,  as  argued  elsewhere 
[20,  21],  has  no  theoretical  justification.  We  also  put  “well  ordered”  under  quotes 
because  ordering  of  descriptors  will  influence  interpretation,  even  though  it  will 
not  influence  the  statistical  parameters  of  the  regression  analysis. 

Optimization  of  molecular  descriptors  is  relatively  novel  technique  in  QSAR 
and  SAR  that  has  been  for  the  most  part  overlooked.  It  is  generally  recog¬ 
nized  that  the  presence  of  heteroatoms  in  a  molecule  requires  use  of  additional 
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molecular  descriptors.  However,  these  additional  descriptors  to  be  used  for  C-X 
bonds  (X  can  be  O,  N,  Cl,  etc.)  are  usually  in  advance  prescribed,  using  some 
physicochemical  analogy  or  data.  For  example,  Kier  and  Hall  introduced  the 
valence  connectivity  indices  by  assigning  to  atoms  valence  parameter  based  on 
the  count  of  valence  electrons  of  each  atom  [22].  Another  possibility,  perhaps 
not  so  widely  known,  uses  covalent  radii  of  carbon  and  other  atoms  in  deriving 
parameters  to  differentiate  atoms  of  different  kind  [23].  In  contrast  one  of  the 
present  authors  considered  variable  weight  as  an  entry  on  the  main  diagonal  of 
the  adjacency  matrix  of  a  molecular  graph.  For  example,  for  ethyl  alcohol  one 
would  have  for  so  generalized  adjacency  matrix: 

/  0  1  0\  /x  I  0\ 

10  1  or  1x0 

Vo  1  y)  \0  1  y) 

Here  x,  and  y  represent  variables  describing  carbon  and  oxygen  atom  respec¬ 
tively.  Using  x  and  y  as  variables  one  can  construct  the  connectivity  indices  (or 
connectivity  weighted  paths)  and  search  for  best  values  of  x  and  y  that  would 
minimize  the  standard  error  in  the  regression  analysis  of  the  property  of  interest 

[24] .  For  example,  in  the  case  of  boiling  points  of  alcohols  one  finds  x  =  1.50  and 
y  =  —0.85  to  result  in  the  smallest  standard  error.  Use  of  the  diagonal  entries  has 
been  already  considered  some  time  ago  in  chemical  documentation  by  Spialter 
who  developed  alphanumeric  matrices  for  a  representation  of  chemical  structure 

[25] .  The  difference  is  however,  that  rather  than  using  symbols  C  and  O  (corre¬ 
sponding  to  x  and  y)  here  we  search  for  numerical  parameters  that  result  in  the 
best  regression.  In  the  case  of  chlorine  atom  the  diagonal  entry  y  =  -20  [26] 
was  found  to  give  a  better  regression  that  approaches  based  on  the  “traditional” 
(i.e.,  the  approaches  following  Hansch’s  methodology  [27])  molecular  descrip¬ 
tors.  Similarly,  in  the  case  of  nitrogen  containing  molecules  the  diagonal  entries 
x  =  1.25  for  carbon  and  y  =  —0.65  for  nitrogen  give  the  optimal  solution  for  the 
boiling  points  of  amines  [28]. 

All  the  above  cases  relate  to  the  connectivity  indices  and  paths  when 
weighted  using  the  same  weighting  algorithm.  However,  variable  descriptors 
can  be  constructed  for  other  topological  indices  besides  the  connectivity  indices. 
Construction  of  these  variable  generalizations  of  the  Wiener  index  [29]  and  the 
Hosoya  index  [30]  have  been  recently  outlined  [31].  Recently  variable  weights 
have  been  considered  for  path  numbers  [32,  33].  We  continue  with  exploration 
of  optimally  weighted  path  numbers  for  characterization  of  molecules  in 
this  article. 
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WEIGHTED  PATH  NUMBERS 

Path  numbers  have  been  suggested  fifty  years  ago  by  Platt  as  potentially  useful 
molecular  descriptors  [34].  Apparently  the  contribution  of  Platt,  despite  its 
importance,  has  been  overlooked  till  a  revived  interest  in  chemical  graph  theory 
emerged  in  mid  1970’ s.  Apparently  through  a  series  of  papers  [35-43]  Randid 
and  Wilkins  resurrected  path  numbers  and  have  illustrated  use  of  paths  for 
characterization  of  molecules  and  their  fragments.  Later  Randic  and  coworkers 
[44-49]  introduced  weights  for  paths  of  different  length  by  weighting  the 
contributions  of  bonds  and  longer  paths  by  using  \/y/{m  n)  as  the  weight  for 
individual  bonds  involved.  Weighted  paths  are  also  implied  in  construction  of 
higher  order  connectivity  indices  [50].  All  these  cases,  however,  used  rigidly 
prescribed  weighting  scheme,  which,  once  adopted  does  not  change. 

As  already  mentioned  the  use  of  the  diagonal  entries  of  the  adjacency  matrix 
as  variable  input  initiated  construction  of  new  kind  of  molecular  descriptors. 
In  contrast  to  hitherto  used  topological  indices  and  other  descriptors  the  new 
descriptors  have  an  inherent  flexibility  that  allows  them  to  be  constructed  so  to 
minimize  the  standard  error  in  a  regression.  Very  recently  this  kind  of  flexibility 
associated  with  variable  weights  has  been  extended  to  construction  of  weighted 
molecular  paths.  This  has  lead  to  generalized  Wiener  number  [32],  and  gener¬ 
alized  path  numbers  [33]  already  mentioned.  Formally  the  Wiener  number  can 
be  written  as: 

W  =  1  Pi  +  2  p2  +  3  p3  +  4  p4  +  •  *  ■  +  k  pk 

where  Pi,p2,P3, . . •  are  the  number  of  paths  of  length  one,  length  two,  length 
three,  etc.  The  above  can  be  viewed  as  dot  product  of  vectors  L  =  (1 , 2, 3, 4, . . .  k) 
and  vector  P  =  (Pj ,  p2,  p3,  •  • .  pk).  If  now  one  introduces  vectors  Lm  of  the  form 
(1  ,2  ,  3m,...km)  the  dot  product  W  becomes  function  of  the  exponent  m, 

1. e.,  instead  of  W  we  have  now  W(m).  Here  one  treats  m  as  variable  and,  for 
example,  in  the  case  of  alkanes  the  best  quadratic  fit  of  motor  octane  numbers 
is  obtained  when  m  =  — 1.50  while  the  best  quadratic  fit  for  the  boiling  points 
of  alkanes  is  obtained  when  m  =  1.90. 

Randic  and  Pompe  [33]  considered  a  different  kind  of  weights  for  paths  when 
examining  the  molar  refraction  of  unsaturated  hydrocarbons.  They  associated  the 
weight  x  to  individual  C— C  bond  in  alkenes  and  assigned  the  weight  x  to  all 
paths  that  involve  C=C  bond.  This  approach  applies  equally  to  characterization 
of  heterobonds,  as  illustrated  by  Randic  and  Basak  when  revisiting  the  correla¬ 
tion  of  the  boiling  points  of  alcohols  [51].  In  Table  I  we  give  the  enumeration 
of  weighted  path  for  3-methyl- 1 -butanol  and  2-pentanol,  which  if  one  does 
not  differentiates  CC  and  CO  bonds  would  give  the  same  path  count  5,  5,  3, 

2,  instead  of  4  +  x,  4  +  x,  2  +  x,  2x  and  4  +  x,  3  +  2x,  2  +  x,  1  +  x  respectively! 
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TABLE  I  Weighted  paths  for  3-methyl- 1 -butanol  and  2-pentanol 


3-Methyl- l-butanol  2-Pentanol 


atom 

Pi 

P2 

P3 

P4 

atom  p, 

P2 

P3 

P4 

1 

1  +x 

1 

2 

1  1 

1  +  X 

1 

1 

2 

2 

2  +  x 

2  2  +  x 

1 

1 

3 

3 

I 

x 

3  2 

2  +  x 

4 

I 

2 

I 

X 

4  2 

1 

1+x 

5 

I 

2 

1 

X 

5  1 

1 

1 

1+X 

6 

X 

X 

x 

2x 

6  x 

2x 

x 

X 

Molecule: 

4  +  x 

4  +  x 

2+  x 

2x 

4  +  x 

4  +  2x 

2+x 

1+x 

Clearly  when  x  =  1  the  two  path  vectors  are  identical,  but  already  setting  x  =  1.1 
or  x  =  0.9  results  in  differentiation  between  the  two  isomers.  In  the  case  of  molar 
refraction  of  heptene  isomers  when  using  three  path  numbers  the  value  of  x  =  0.6 
leads  to  an  impressive  reduction  in  the  standard  error  (s  =  0.08). 


REVIEW  OF  THE  EXPERIMENTAL  DATA  USED 

QSAR  and  SAR  studies  often  point  to  FEW  experimental  points  that  do  not  fit 
well  the  derived  correlation.  So  identified  outliers  are  then  omitted  from  correla¬ 
tions  with  some  justification,  even  though  the  source  for  the  disagreement  is  not 
known  and  need  not  be  attributed  to  presumed  experimental  error.  It  is  possible 
that  some  outliers  have  unrecognized  structural  features  which  the  descriptors 
used  can  not  adequately  characterize  that  makes  them  exceptional.  Neverthe¬ 
less,  by  being  different  than  other  compounds  under  analysis,  the  outliers  may 
legitimately  be  eliminated  from  considerations.  In  our  study,  as  will  be  seen 
shortly  we  were  able  to  identify  one  such  outlier  even  before  starting  the  regres¬ 
sion  analysis.  Having  several  properties  of  alcohols  available  we  decided  first 
to  review  property-property  correlations  of  alcohols  to  be  studied.  This  pointed 
to  a  discrepancy  for  the  experimental  data  of  2-hexanol. 

We  have  selected  the  following  properties  of  alcohols:  (a)  water  solubility 
(-  log  S);  (b)  cavity  surface  area  (CS  A);  (c)  octanol  water  partition  (log  P);  and 
(d)  infinite  dilution  activity  coefficient  (In  7  ).  Already  in  ref.  [51]  we  examined 
the  boiling  points  of  alcohols.  All  these  properties  have  been  recently  studied  by 
MRA  using  alternative  molecular  descriptors  by  Cao  and  Li  for  —  log  S,  CSA, 
and  log  P  [52],  and  by  Mitchell  and  Jurs  for  In 7  [53].  A  set  of  n  =  50  alcohols 
were  used  when  considering  —  log  S  and  CSA,  a  set  of  n  =  38  alcohols  were 
used  in  log  P  study  and  a  set  of  n  =  43  alcohols  were  used  for  In  7  study.  In 
Table  II  we  collected  the  experimental  data  for  a  subset  of  alcohols  studies  in 
ref.  [51-53]. 
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TABLE  II  Common  experimental  data  for  different  sets  of  alcohols  studied  (including 
boiling  points  studied  in  ref.  [20]) 


Alcohol 

-logS 

CSA 

log  P 

In  7 

b.p. 

I -butanol 

0 

272.1 

0.88 

3.92 

117.7 

2M-1 -propanol 

-0.01 

263.8 

0.61 

3.89 

107.9 

1-Pentano! 

0.58 

303.9 

1.40 

5.29 

137.8 

3M-1 -butanol 

0.51 

291.4 

1.14 

5.34 

131.2 

2M-1 -butanol 

0.46 

289.4 

1.14 

5.08 

128.7 

2-Pentanol 

0.28 

295.9 

1.14 

4.57 

119.0 

1-HexanoI 

1.21 

335.7 

2.03 

6.68 

157.0 

2-Hexanol 

0.87 

327.7 

1.61 

5.64 

139.9 

3-Hexanol 

0.80 

325.3 

1.61 

5.85 

135.4 

3M-3-pentano! 

0.36 

305.8 

1.39 

4.85 

122.4 

2M-2-pentanol 

0.49 

314.3 

1.39 

5.14 

121.4 

2M-3-pentanol 

0.70 

314.3 

1.41 

5.63 

126.5 

3M-2-pentanol 

0.71 

311.3 

1.41 

5.66 

134.2 

2,3MM-2-butanol 

0.37 

301.2 

1.17 

4.88 

118.6 

3,3MM-2-butanoI 

0.61 

296.7 

1.19 

5.43 

120.0 

4M-2-pentanol 

0.79 

314.9 

1.41 

5.86 

131.7 

1-HeptanoI 

1.81 

367.5 

2.34 

8.09 

176.3 

2M-2-hexanol 

1.07 

346.1 

1.87 

6.49 

142.5 

3M-3-hexanol 

0.98 

337.7 

1.87 

6.29 

142.4 

3E-3-pentanoI 

0.83 

324.4 

1.87 

5.94 

142.5 

2,3MM-2-pentanoI 

0.87 

323.8 

1.67 

6.02 

139.7 

2,3MM-3-pentanol 

0.84 

321.8 

1.67 

5.96 

139.0 

2,4MM-3-pentanol 

1.22 

331.7 

1.71 

6.82 

138.8 

2,2-MM-3-pentanol 

1.15 

326.1 

1.69 

6.66 

136.0 

1-Octanol 

2.35 

399.4 

2.84 

9.56 

195.2 

2,2,3MMM-3-pentanol 

1.27 

335.2 

1.99 

6.95 

152.2 

1-Nonanol 

3.00 

431.2 

3.15 

11.0 

213.1 

In  Figure  1  we  illustrate  the  correlations  for  the  properties  listed  in  Table  II. 
In  Figure  la  —  Figure  Id  we  show  correlation  of  the  four  properties  consid¬ 
ered  here  (“log  S,  CSA,  log  P  and  In 7)  with  the  boiling  points  of  alcohols. 
The  correlations  between  the  four  properties  among  themselves  (included  in 
Table  III)  show  similar  behavior,  similar  scatter  of  points,  with  a  single  excep¬ 
tion.  The  exceptional  is  the  correlation  between  the  two  solubilities  —  log  S  and 
In  7  ,  shown  in  Figure  le,  which  display  extremely  high  correlation.  While  for 
most  other  property-property  correlations  of  Table  III  the  regression  coefficients 
is  between  r  =  0.950  and  r  =  0.990  the  correlation  of  —log  S  and  In 7  have 
r  =  0.998.  That  -  log  S  and  In  7  make  exceptional  correlation  is  even  better 
reflected  in  Fisher  ratio,  which  for  all  mutual  property-property  correlations  is 
below  500,  but  —  log  S  and  In  7  have  impressive  F  close  to  7000. 

It  is  clear  from  Figure  le  that  a  single  point  appears  to  be  an  outlier,  most 
likely  an  experimental  error  either  in  —log  S  or  In 7.  When  this  point  (that 
belongs  to  2-hexanol)  is  eliminated  the  revised  regression  (shown  in  the  lower 
part  of  Table  III  and  indicated  by  an  asterisk)  of  —  log  S  and  In  7  shows  a 
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(a) 


FIGURE  1  Correlations  between  different  experimental  properties  of  smaller  alcohols.  Illustrations 
(a)  —  (d):  Correlations  with  their  experimental  boiling  points:  Negative  logarithm  of  solubility  S; 
critical  surface  area  CSA;  logarithm  of  octanol/water  partition  P;  natural  logarithm  of  solublility  7, 
respectively.  Illustration  (e):  Correlation  between  the  solubilities  —  log  S  and  In  7, 


log  P 
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TABLE  Ill  Comparison  of  correlation  parameters  for 
property-property  correlations  of  alcohols.  The  asterisk  (*) 
indicates  the  regression  in  which  outlier  was  removed 


Propc  rty-prope  rty 

r 

s 

F 

-log  S/b.p. 

0.9705 

0.161 

404 

CSA/b.p. 

0.9499 

11.15 

231 

log  P/b.p. 

0.9620 

0.153 

310 

In  g/b.p. 

0.9669 

0.400 

359 

—  log  S/  In  7 

0.9982 

0.040 

6873 

CSA/ln7 

0.9721 

8.372 

429 

log  P/ln7 

0.9645 

0.147 

334 

-  log  S/  log  P 

0.9674 

0.169 

364 

CSA/Iog  P 

0.9843 

6.296 

778 

-log  S/CSA 

0.9752 

0.1479 

486 

—  log  S’/  In  7* 

0.9993 

0.026 

16,752 

In  y 


(e) 

FIGURE  1  ( Continued ). 

dramatic  improvement  (r  =  0.999  and  F  is  over  16,750).  This  further  supports 
the  suspicion  that  one  of  the  experimental  results  for  2-hexanol  was  in  error. 

That  the  selected  alcohol  properties  show  limited  correlation  (except  for 
already  mentioned  intercorrelation  of  the  two  solubilities)  points  to  the  fact 
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that  different  properties  are  dominated  by  different  structural  factors  and  will 
require  different  molecular  descriptors.  Clearly  the  considered  properties  can 
not  be  reduced  to  the  same  structural  features,  which  for  itself  speaks  why  we 
need  different  molecular  indices  and  should  continue  to  design  novel  topological 
descriptors. 

That  2-hexanol  is  an  outlier  is  even  better  visible  in  Figure  2  in  which  we 
show  the  same  regressions  between  -  log  S  and  In  7  but  have  limited  the  set 
of  alcohols  to  isomers  of  1-hexanol.  In  this  way  we  eliminated  the  dominant 
role  of  molecular  size  (since  we  consider  only  alcohols  having  the  same  number 
of  carbon  atoms).  In  Table  IV  we  give  the  statistical  data  for  regressions  the 
corresponding  regressions  when  considering  n  =  10  hexanols.  As  we  see  from 
Table  IV  the  statistical  parameters  have  changed  dramatically  not  only  because 
we  have  a  smaller  sample  but  it  is  much  harder  to  fit  data  for  molecules  of 
a  same  size  than  correlating  data  for  molecules  of  different  size.  The  standard 
error  which  now  reflects  the  isomeric  variations  has  decreased  but  the  correlation 
coefficient  also  decreased,  because  it  is  more  difficult  to  correlate  that  part  of  a 
property  that  does  not  depend  on  size  than  the  part  of  the  property  that  is  size 
dependent.  That  2-hexanol  is  outlier  is  now  reflected  in  an  unusual  increase  (by 


FIGURE  2  The  regression  between  the  solubilities  -log  S  and  In 7  for  subset  of  isomers  of 
1-hexanol. 
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TABLE  IV  Comparison  of  correlation  parameters  for  prop¬ 
erty-property  correlations  for  the  subset  of  heptanols  only. 
The  asterisk  (*)  indicates  the  regression  in  which  outlier  was 
removed 


Property-p  roperty 

r 

s 

F 

In  t/( —  log  S) 

0.9790 

0.1163 

184.9 

Mn7/(— log  S) 

0.9987 

0.0313 

2658.2 

b.  p./CSA 

0.8932 

5.6100 

31.6 

log  P/b.p. 

0.9484 

0.0827 

71.5 

In7/b.p. 

0.9003 

0.2486 

34.2 

an  order  of  magnitude)  of  the  Fisher  ratio  for  regression  including  and  excluding 
2-hexanol. 


WEIGHTED  PATHS  AS  DESCRIPTORS 

Even  though  correlations  between  different  properties  may  vary  considerably  a 
single  set  of  well  selected  molecular  descriptors,  may  nevertheless  provide  a 
basis  for  their  regression  analysis.  This  has  been  already  illustrated  using  a  set 
of  the  connectivity  indices  in  correlating  different  physicochemical  properties 
of  alkanes  [54,  55].  However  all  previous  such  studies  were  based  on  “fixed” 
molecular  descriptors  (topological  indices).  It  is  of  interest  to  see  how  variable 
molecular  topological  indices  using  an  optimization  procedure  to  determine  the 
best  set  of  descriptors  would  describe  different  molecular  properties  for  the  same 
very  sets  of  compounds. 

In  Table  V  we  listed  the  count  of  smaller  paths  in  alcohols  by  discriminating 
C-O  bond  to  which  we  give  weight  x.  For  p,  this  simply  increases  the  count  of 
CC  bonds  by  x,  but  even  this  increment  may  be  different  for  different  properties. 


RESULTS 

We  should  not  be  surprised  that  the  weights  of  paths  x  vary  when  we  consider 
different  properties  even  for  the  same  set  of  compounds.  We  have  seen  already 
that  different  molecular  properties,  particularly  when  focusing  attention  to 
isomeric  variations,  do  not  correlate  at  all  one  with  another. 

We  have  previously  reported  a  quite  successful  correlation  for  alcohol  boiling 
points  when  using  variable  path  numbers.  In  the  case  of  alcohols  it  was  found 
that  optimal  weight  for  CO  bond  x  =  2.2  reduced  the  standard  error  to  s  =  4.82 
when  path  numbers  p,  and  p2  were  used  as  descriptors,  and  to  s  =  4.78  when 
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TABLE  V  Count  of  smaller  paths  in  alcohols  with  CO  having  weight  x 


Alcohol 

Pi 

P2 

P3 

P4 

P5 

I -Butanol 

3  +  x 

2  +  x 

I  +  X 

0 

0 

2M- 1  -propanol 

3  +  x 

3  +  x 

2x 

0 

0 

1-Pentanol 

4  +  x 

3  +  x 

2  +  x 

1  +  X 

0 

3M-1 -butanol 

4  +  x 

4  +  x 

2  +  x 

1  +  X 

0 

2M- 1 -butanol 

4  +  x 

4  +  x 

2  +  2x 

X 

0 

2-Pentano! 

4  +  x 

3  +  2x 

2+  x 

1  +  X 

0 

1-Hexanol 

5  +  x 

4  +  x 

3  +  x 

2  +  x 

1  +  X 

2-Hexanol 

5  +  x 

4  +  2x 

3  +  x 

2  +  x 

I  +  X 

3-HexanoI 

5  +  x 

4  +  2x 

3  +  2x 

2  +  x 

I 

3M-3-pentanol 

5  +  x 

5  +  3x 

4  +  2x 

1 

0 

2M-2-pentanoI 

5  +  x 

5  +  3x 

3  +  x 

2  +  x 

0 

2M-3-pentanol 

5  +  x 

5  +  2x 

3  +  3x 

2 

o 

3M-2-pentano! 

5  +  x 

5  +  2x 

4  +  2x 

1  +  x 

0 

2,3MM-2-butanol 

5  +  x 

6  +  3x 

4  +  2x 

0 

o 

3,3MM-2-butanol 

5  +  x 

7  +  2x 

3  +  3x 

0 

0 

4M-2-pentanol 

5  +  x 

5  +  2x 

3  +  x 

2  +  2x 

0 

I-Heptanol 

6  +  x 

5  +  x 

4  +  x 

3  +  x 

2  +  x 

2M-2-hexanoI 

6  +  x 

6  +  3x 

4  +  x 

3  +  x 

2  +  x 

3M-3-hexanol 

6  +  x 

6  +  3x 

5  +  2x 

3  +  x 

1 

3E-3-pentanoI 

6  +  x 

6  +  3x 

6  +  3x 

3 

0 

2,3MM-2-pentanol 

6  +  x 

7  +  3x 

6  +  2x 

2  +  x 

0 

2,3MM-3-pentanoI 

6  +  x 

7  +  3x 

6  +  3x 

2 

0 

2,4MM-3-pentanol 

6  +  x 

7  +  2x 

4  +  4x 

4 

o 

2,2-MM-3-pentanol 

6  +  x 

8  +  2x 

4  +  4x 

3 

o 

1-Octanol 

7  +  x 

6  +  x 

5  +  x 

4  +  x 

3  +  x 

2,2,3MMM-3-pentano! 

7  +  x 

9  +  3x 

8  +  4x 

3 

o 

l-Nonanol 

8  +  x 

7  +  x 

6  +  x 

5  +  x 

4  +  x 

path  numbers  p,,  p2  and  p3  were  used  as  descriptors.  The  above  results  can 
be  compared  with  the  standard  error  of  9  °C,  obtained  by  Nikolic,  Trinajstic, 
and  Mihalic  [56],  who  considered  the  Wiener  number,  the  Shultz  index,  and 
the  valence  connectivity  index  as  descriptors.  Admittedly  these  authors  consid¬ 
ered  regressions  based  on  a  single  descriptor  in  order  to  evaluate  the  relative 
merits  of  individual  descriptors.  Hence,  the  standard  error  of  9  °C  is  not  directly 
comparable  to  the  standard  error  when  one  uses  two  or  more  descriptors  (which 
can  drop  to  bellow  5  °C).  However,  if  one  is  interested  in  obtaining  the  best 
regression  having  statistical  significance  and  giving  as  small  as  possible  stan¬ 
dard  error  than  clearly  the  procedure  based  on  optimally  weighted  paths  has,  as 
demonstrated,  its  advantages. 

A  number  of  interesting  questions  can  be  posed:  (1)  Does  the  optimal  weight 
depends  on  compounds  (alcohols)  selected?  In  particular,  does  it  depend  on 
the  size  of  molecules?  (2)  Does  the  optimal  value  of  x  depends  on  the  number 
of  parameters  used?  (3)  Does  the  optimal  values  for  x  depends  on  the  prop¬ 
erty  considered?  Here  we  will  focus  on  the  last  two  questions.  In  Table  VI  we 
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TABLE  VI  Dependence  of  the  statistical  para¬ 
meters  on  the  CO  bond  weight.  The  optimal 
value  of  the  weight  x  is  emphasized 


(a)  Surface  cavity  area  (CSA) 


X 

r 

s 

F 

-1 

0.9645 

15.070 

205 

0 

0.9952 

5.583 

1588 

0.3 

0.9977 

3.865 

3330 

0.5 

0.9980 

3.599 

3842 

0.7 

0.9976 

3.918 

3241 

I 

0.9964 

4.848 

2111 

1.5 

0.9937 

6.382 

1212 

2 

0.9913 

7.722 

868 

2.5 

0.9893 

8.337 

704 

3 

0.9877 

8.934 

611 

3.5 

0.9863 

9.497 

549 

4 

0.9854 

9.734 

512 

5 

0.9838 

10.239 

461 

6 

0.9827 

10.585 

431 

7 

0.9818 

10.835 

410 

(b)  Water  solubilities  (— 

log  S) 

1 

0.9883 

0.1653 

644 

1.5 

0.9925 

0.1325 

1011 

2 

0.9946 

0.1127 

1402 

2.4 

0.9954 

0.1038 

1655 

2.5 

0.9955 

0.1023 

1706 

2.6 

0.9956 

0.1018 

1721 

3 

0.9959 

0.0975 

1879 

3.5 

0.9961 

0.0961 

1932 

4 

0.9961 

0.0960 

1941 

5 

0.9959 

0.0981 

1855 

6 

0.9907 

0.1011 

1749 

7 

0.9954 

0.1039 

1655 

(c)  Octanol  -Water  partition  (log  P) 


1 

0.9845 

0.1369 

358 

1.5 

0.9873 

0.1240 

439 

2 

0.9885 

0.1183 

483 

2.25 

0.9887 

0.1170 

498 

2.5 

0.9889 

0.1160 

503.1 

3 

0.9890 

0.1156 

506.3 

3.25 

0.9890 

0.1157 

505.8 

3.5 

0.9890 

0.1158 

504.6 

4 

0.9886 

0.1175 

491 

show  the  dependence  of  the  statistical  parameters  r,  s,  and  F  on  the  weight 
x  for  each  property  separately.  As  we  see  from  Table  VI  even  though  we 
have  essentially  the  same  set  of  compounds  the  optimal  weights  vary  from 
property  to  property  displaying  dramatic  changes.  For  each  property  we  gave 
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TABLE  VI  ( Continued ) 


(d)  Infinite  dilution  activity  coefficient  (In  7) 


X 

r 

$ 

F 

1 

0.9974 

0.4021 

2493 

2 

0.9989 

0.2674 

5656 

3 

0.9992 

0.2174 

8564 

4 

0.9994 

0.1995 

10173 

5 

0.9994 

0.1892 

11307 

6 

0.9994 

0.1854 

11782 

7 

0.9995 

0.1836 

12007 

8 

0.9995 

0.1829 

12100 

9 

0.9995 

0.1827 

12124 

10 

0.9995 

0.1828 

12112 

12 

0.9995 

0.1834 

12039 

15 

0.9995 

0.1844 

11903 

the  correlation  coefficient  r,  the  standard  error  s,  and  the  Fisher  ratio  F,  as 
they  vary  with  x,  which  has  been  confined  to  the  appropriate  domains.  In 
view  of  relatively  small  number  of  molecules  in  each  set  (between  38  and 
50)  we  limited  the  number  of  descriptors  at  most  three  and  have  used  plt  p2 
and  p3. 

For  CSA  the  best  value  found  for  the  weight  (which  is  emphasized  in 
Table  VI)  is:  x  =  0.5,  the  value  x  =  3  is  optimal  for  log  P  regression,  the  value 
x  =  4  is  optimal  for  —  log  S,  and  finally  the  value  x  =  9  is  the  optimal  value  for 
In  7  .  These  values  of  x  may  be  compared  to  x  =  2.2  found  as  the  best  value  for 
the  boiling  points  of  alcohols.  Hence,  clearly  the  weight  x  critically  depends  on 
the  property  considered. 

The  increase  of  the  weight  x  means  that  the  role  of  C-O  bond  relative  to 
C-C  bonds  is  gaining  in  the  importance.  In  Figure  3a  we  have  illustrated  for 
the  regression  of  —  log  S  against  the  weighted  paths  p,,  p2,  p3  the  variation 
of  the  standard  error  s  against  the  weight  x  while  in  Figure  3b  the  similar 
dependence  of  the  standard  error  s  against  the  weight  x  is  shown  for  CSA.  Both 
figures  show  the  position  of  the  minimum  which  corresponds  to  the  optimal 
weight  for  x  and  show  a  characteristic  asymmetric  shape  of  the  dependence  of 
s(x)  similar  in  shape  to  potential  curves  for  a  diatomic  molecules,  or  parts  of 
such  curves. 

Table  VII  lists  the  optimal  paths  p2  and  p3  for  the  common  27  alcohols  (for 
which  data  on  all  four  properties  were  available)  when  optimal  values  of  x 
are  selected  for  each  property.  The  optimal  path  p,  are  not  listed  and  can  be 
easily  derived  using  expression  p,  =  nCC  +  x,  where  nCC  is  the  number  of  CC 
bonds  in  a  molecule.  The  occurrence  of  different  weights  for  different  prop¬ 
erties  introduces  changes  in  the  relative  role  of  shorter  and  longer  paths  for 
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TABLE  VII  The  optimal  weighted  paths  p2  and  p3  for  the  five  properties  of  alcohols 


Alcohol 

-log  S 
x  =  4 

CSA 
x  =  0.5 

log  P 
x  =  3 

In 

x  =  9 

I -Butanol 

6 

5 

2.5 

1.5 

5 

4 

11 

10 

2M-1 -propanol 

7 

8 

3.5 

1 

6 

6 

12 

18 

I-Pentanol 

7 

6 

3.5 

2.5 

6 

5 

12 

11 

3M-1 -butanol 

8 

6 

4.5 

2.5 

7 

5 

13 

11 

2M-1 -butanol 

8 

10 

4.5 

3 

7 

8 

13 

20 

2-PentanoI 

11 

6 

4 

2.5 

9 

5 

21 

11 

4-HexanoI 

8 

7 

4.5 

3.5 

7 

6 

13 

12 

2-Hexanol 

12 

7 

5 

3.5 

10 

6 

22 

12 

3-Hexanol 

12 

11 

5 

4 

10 

9 

22 

21 

3M-3-pentanoI 

17 

12 

6.5 

5 

14 

10 

32 

22 

2M-2-pentanol 

17 

7 

6.5 

3.5 

14 

6 

32 

12 

2M-3-pentanol 

13 

15 

6 

4.5 

11 

12 

23 

30 

3M-2-pentanoI 

13 

12 

6 

5 

11 

10 

23 

22 

2,3MM-2-butanol 

18 

12 

7.5 

5 

15 

10 

33 

22 

3,3MM-2-butanol 

15 

15 

8 

4.5 

13 

12 

25 

30 

4M-2-pentanol 

13 

7 

6 

3.5 

11 

6 

23 

12 

1-Heptanol 

9 

8 

5.5 

4.5 

8 

7 

14 

13 

2M-2-hexanol 

18 

8 

7.5 

4.5 

15 

7 

33 

13 

3M-3-hexanol 

18 

13 

7.5 

6 

15 

11 

33 

23 

3E-3-pentanol 

18 

18 

7.5 

7.5 

15 

15 

33 

33 

2,3MM-2-pentanol 

19 

14 

8.5 

7 

16 

11 

34 

24 

2,3MM-3-pentanol 

19 

18 

8.5 

7.5 

16 

15 

34 

33 

2.4MM-3  -pentanol 

15 

20 

8 

6 

13 

16 

25 

40 

2,2-MM-3-pentanol 

16 

20 

9 

6 

14 

16 

26 

40 

1-Octanol 

10 

9 

6.5 

5.5 

9 

8 

15 

14 

2,2,3MMM-3-pentanol 

22 

24 

11.5 

10 

19 

20 

36 

44 

1-Nonanol 

11 

10 

7.5 

6.5 

10 

9 

16 

15 

different  structures.  Consider  for  example  2-methyl- 1-butanol  and  2-pentanol 
(of  Table  II).  When  x  =  0.5  (the  optimal  value  for  CSA)  the  quotient  p2/p3 
for  2-methyl- 1 -butanol  and  2-pentanol  are  not  very  different,  4.5/3  and  4/2.5 
respectively.  In  contrast  when  x  =  4  (optimal  value  for  -  log  S)  the  quotient 
P2/P3  for  2-methyl- 1  -butanol  and  2-pentanol  are  very  different,  8/10  and  11/6 
respectively.  The  standard  topological  indices  lack  the  flexibility  to  adjust  simi¬ 
larly  to  such  demand  dictated  by  diversity  of  properties. 

In  Table  VUI  we  listed  the  calculated  properties  and  the  residuals  of  the 
regression  as  obtained  for  the  common  n  =  27  alcohols.  For  all  the  four  prop¬ 
erties  all  the  caclulated  values  are  within  two  standard  deviations,  except  in  the 
case  of  SCA  where  highly  branched  2,  2,  3-trimethyl-3-pentanol  shows  large 
residual.  The  calulaterd  CSA  is  found  too  small:  324.27,  the  reported  experi¬ 
mental  value  is  however  335.2.  By  discarding  this  point  as  an  outlier  the  standard 
error  dropps  to  3.124.  The  regression  equations  are  listed  in  Table  IX. 
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TABLE  VIII  Calculated  properties  of  alcohols.  We  displayed  only  the  results  for 
alcohols  of  Table  II,  but  the  calculations  were  based  on  all  alcohols  for  which  data 
were  available 

Alcohol  —log  S*  Residual  CSA  Residual 


0.035  -0.035 

-0.083  0.073 

0.620  -0.040 

0.538  -0.028 


1-Butanol 
2M-1 -propanol 

1- Pentanol 
3M-1 -butanol 
2M-1 -butanol 

2- Pentano! 

1- HexanoI 

2- Hexano! 

3- Hexanol 
3M-3-pentanol 
2M-2-pentanol 
2M-3-pentanol 
3M-2-pentanol 
2,3MM-2-butanol 
3,3MM-2-butanol 
4M-2-pentanol 
1-Heptanol 
2M-2-hexanol 
3M-3-hexanol 
3E-3-pentanol 
2,3MM-2-pentanol 
2,3MM-3-pentanol 
2,4MM-3-pentanol 
2,2-MM-3-pentanol 
1-Octanol 

2,2,3MMM-3-pentanol 

1-Nonanol 


0.490 

-0.030 

0.293 

-0.013 

1.205 

0.005 

0.878 

-0.008 

0.830 

-0.030 

0.410 

-0.050 

0.470 

0.020 

0.700 

0 

0.736 

-0.026 

0.329 

0.042 

0.537 

0.073 

0.797 

-0.007 

1.790 

0.020 

1.055 

0.015 

0.995 

-0.015 

0.934 

-0.104 

0.901 

-0.031 

0.853 

-0.013 

1.155 

0.065 

1.074 

0.076 

2.375 

-0.025 

1.214 

0.056 

2.960 

0.040 

270.19 

1.91 

262.94 

0.86 

301.73 

2.17 

291.93 

-0.53 

289.36 

0.04 

296.83 

-0.93 

333.28 

2.42 

328.38 

-0.68 

325.81 

-0.51 

305.98 

-0J8 

313.67 

0.63 

313.44 

0.86 

310.88 

0.42 

296.18 

5.03 

303.86 

3.64 

318.57 

-3.67 

364.83 

2.67 

345.21 

0.89 

337.52 

0.18 

329.83 

-5.43 

322.59 

1.21 

320.02 

1.76 

332.62 

-0.92 

322.81 

3.29 

396.37 

3.03 

324.27 

10.93 

427.92 

3.28 

COMPARISON  WITH  MRA  FROM  OTHER  SOURCES 

.Comparison  between  different  regression  results  are  primarily  of  interest  because 
they  can  point  to  dominant  and  the  most  relevant  molecular  descriptors  for  prop¬ 
erties  studied.  When  such  descriptors  are  identified  they  can  assist  in  revising 
or  refining  molecular  models  for  compounds  considered.  The  standard  error  is 
likely  to  point  to  most  useful  regression  if  the  accuracy  of  the  prediction  is  the 
only  criteria  considered.  However,  the  standard  error  important  as  it  is,  is  not 
necessarily  the  only  parameter  of  interest  in  structure-property-activity  studies. 
Equally  important,  or  even  more  important,  may  be  the  structural  meaning  of 
the  descriptors  used  as  they  can  facilitate  not  only  an  improvement  of  the  model 
used  but  also  may  offer  a  better  insight  into  our  understanding  of  the  structure- 
property  relationship,  even  though  structure-property  correlation  does  not  invoke 
causal  relationship. 

A  strict  comparison  between  different  regression  results  is  only  possible  if  the 
two  studies  use  the  same  experimental  data  on  the  same  set  of  compounds  with 
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TABLE  VIII  (i Continued ) 


Alcohol 

log  P 

Residual 

In  7 

Residual 

I -Butanol 

0.802 

0.078 

4.018 

-0.098 

2M-1 -propanol 

0.703 

-0.093 

3.829 

0.061 

I-Pentanol 

1.310 

0.091 

5.372 

-0.082 

3M-1 -butanol 

1.243 

-0.103 

5.281 

0.059 

2M-I -butanol 

1.194 

-0.054 

5.171 

-0.091 

2-Pentanol 

1.109 

0.031 

4.556 

0.014 

1-Hexanol 

1.817 

0.213 

6.726 

-0.046 

2-Hexanol 

1.617 

-0.007 

5.910 

-0.270 

3-Hexanol 

1.568 

0.042 

5.799 

0.051 

3M-3-pentanol 

1.285 

0.106 

4.880 

-0.030 

2M-2-pentanol 

1.349 

0.041 

5.003 

0.137 

2M-3-pentanol 

1.452 

-0.042 

5.598 

0.032 

3M-2-pentanol 

1.485 

-0.075 

5.696 

-0.036 

2,3MM-2-butanol 

1.218 

-0.048 

4.789 

0.091 

3,3MM-2-butanol 

1.319 

-0.129 

5.416 

0.014 

4M-2-pentanol 

1.550 

-0.140 

5.819 

0.041 

1-Heptanol 

2.324 

0.016 

8.080 

0.010 

2M-2-hexanol 

1.857 

0.013 

6.357 

0.133 

3M-3-hexanol 

1.792 

0.078 

6.234 

0.056 

3E-3-pentanol 

1.727 

0.143 

6.111 

-0.171 

2,3MM-2-pentanol 

1.725 

-0.055 

6.131 

-0.111 

2,3MM-3-pentanol 

1.660 

0.010 

6.020 

—0.060 

2,4MM-3-pentanoI 

1.844 

-0.134 

6.750 

0.070 

2.2-MM-3  -pentanol 

1.778 

-0.088 

6.660 

0.000 

1-OctanoI 

2.832 

0.008 

9.434 

0.126 

2,2,3  MMM-3-pentanol 

1.969 

0.021 

7.161 

-0.211 

1-NonanoI 

3.339 

-0.189 

10.788 

0.212 

TABLE  IX  The  regression  equations 


Property 

Pi 

Pi 

P3 

Constant 

CSA 

46.4797 

-9.8066 

-5.1272 

139.7160 

-log  S 

0.6660 

-0.0802 

-0.0115 

-4.0677 

log  P 

0.5904 

-0.0668 

-0.0162 

-2.3415 

In  7 

1.4571 

-0.0907 

-0.0123 

-10.8897 

the  same  number  of  descriptors.  This  is  rarely  the  case,  because  between  two 
studies  novel  data  may  be  available  and  is  likely  to  be  included  in  more  recent 
work.  In  addition  different  authors  may  have  their  own  preferences  for  selecting 
and  testing  descriptors  using  larger  set  of  compounds  that  allow  increased 
number  of  descriptors.  Our  comparison  here  is  of  such  a  kind  because  Cao 
and  Li  [52]  who  reported  MRA  on  water  solubility,  surface  cavity  area,  and 
log?  included  in  their  set  of  alcohols  also  alkanes  and  cyclo-alkanes.  Similarly 
Mitchell  and  Jurs  [53]  besides  alcohols  included  a  variety  of  organic  compounds 
having  other  heteroatoms  (halogens,  nitrogen).  As  we  will  see  for  our  results  the 
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standard  error  has  been  dramatically  decreased,  in  comparison  with  the  above 
mentioned  results,  except  for  correlation  for  log  P  where  the  improvement  is 
significant,  but  not  dramatic.  In  view  of  the  differences  in  the  size  of  samples 
and  the  diversity  of  compounds  it  should  not  be  surprising  that  we  get  smaller 
standard  error  than  others.  What  is  surprising  is  by  how  much  we  have  reduced 
the  standard  error  when  using  optimized  descriptors. 

Here  are  listed  r  and  s  for  the  property  studies  as  reported  in  ref.  [52,  53]  and 

in  this  work: 


Property 

CSA 
-logS 
log  P 

In  7 
In  7 
In  7 


0.9954 

0.994 

0.992 


Property 

CSA 
SCA* 
-log  S 
-log  S* 
log  P 
In  7 


0.9980 

0.9985 

0.9961 

0.9978 

0.9890 

0.9995 


3.599 

3.124 

0.0960 

0.0713 

0.1156 

0.1827 


this  work 
this  work 
this  work 
this  work 
this  work 
this  work 


Here  n  is  the  size  of  the  sample  (structures)  and  N  is  the  number  of  parameters 
(descriptors)  used  in  the  regressions,  while  F  is  Fisher  ratio. 


CONCLUDING  REMARKS 

We  have  outlined  a  novel  way  of  deriving  powerful  structure-property  models. 
We  consider  assigning  to  shorter  paths  in  molecules  variable  weight  x,  to 
be  determined  during  the  regression  analysis  so  that  one  obtains  the  smallest 
standard  error  for  correlation  considered.  Even  though  the  approach  has  been 
demonstrated  on  several  physico-chemical  properties  of  simple  chemical  struc¬ 
tures,  it  is  general  and  applies  to  analysis  of  properties  of  more  complex  chemical 
compounds.  The  advantage  of  the  outlined  approach  is  that  it  yields  regressions 
accompanied  with  considerably  smaller  standard  error  than  are  given  by  similar 
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studies  using  standard  molecular  descriptors.  The  “flexibility”  of  the  molecular 
descriptor,  such  as  weighted  paths  used  in  this  study,  in  contrast  to  the  fixed 
molecular  descriptors,  which  are  numerically  determined  one  molecular  struc¬ 
ture  is  known,  makes  it  possible  to  describe  different  properties  of  a  same 
set  of  compounds  by  the  same  kind  of  descriptors.  As  can  be  seen  from  the 
illustration  given  by  analysis  of  several  properties  of  smaller  alcohols  different 
properties  may  require  different  weighting  factors.  This  suggests  that  methods 
In  which  prescribed  modification  of  topological  indices  are  assumed  in  order  to 
describe  heteroatoms,  such  as  for  example  the  valence  connectivity  indices  of 
Kier  and  Hall,  have  inherent  limitations,  in  that  they  may  be  suitable  for  some 
molecular  properties  but  less  suitable  for  others.  Indeed,  several  authors  have 
reported  correlations  for  compounds  involving  heteroatoms  for  which  a  simple 
connectivity  index  gives  a  better  regression  that  the  corresponding  valence 
connectivity  index. 
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In  this  article  we  (1)  outline  the  construction  of  a  3-D  “graphical”  representation  of  DNA  primary  sequences, 
illustrated  on  a  portion  of  the  human  /?  globin  gene;  (2)  describe  a  particular  scheme  that  transforms  the 
above  3-D  spatial  representation  of  DNA  into  a  numerical  matrix  representation;  (3)  illustrate  construction 
of  matrix  invariants  for  DNA  sequences;  and  (4)  suggest  a  data  reduction  based  on  statistical  analysis  of 
matrix  invariants  generated  for  DNA.  Each  of  the  four  contributions  represents  a  novel  development  that 
we  hope  will  facilitate  comparative  studies  of  DNA  and  open  new  directions  for  representation  and 
characterization  of  DNA  primary  sequences. 


INTRODUCTION 

With  rapid  reporting  of  DNA  sequences  derived  with 
automated  DNA  sequencing  techniques  the  problem  of 
processing  such  information  became  acute.  Usual  representa¬ 
tion  of  the  primary  sequence  DNA  is  that  of  a  string  of  letters 
A,  G,  C,  T,  which  signify  the  four  nucleic  acid  bases  adenine, 
guanine,  cytosine,  and  thymine,  respectively.  Such  sequences 
can  be  very  long,  and  even  the  segments  of  interests  when 
comparing  DNA  of  different  species  can  be  quite  lengthy. 
In  Table  1  we  listed  DNA  of  human  globin  gene.  Its  length 
is  1424,  and  its  first  exon  already  involves  92  bases. 
Comparison  of  such  primary  sequences,  and  even  their 
fragments  having  less  than  100  bases,  could  be  quite  difficult 
for  several  reasons.  Consider  the  list  of  the  first  exon  of  the 
fi  globin  gene  for  eight  different  species  shown  in  Table  2. 
They  all  look  similar,  but  at  the  same  time  they  are  all 
sufficiently  different.  How  similar  or  how  different  they  are 
may  depend  on  how  such  strings  of  letters  are  encoded  or 
characterized.  The  standard  procedures  consider  differences 
between  strings  due  to  deletion-insertion,  compression- 
expansion,  and  substitution  of  the  string  elements.1’9  These 
approaches  have  been  applied  to  a  variety  of  problems,  from 
the  error  correcting  codes  in  which  Levenshtein  has  intro¬ 
duced  metrics  for  string  comparisons1  to  comparison  of  DNA 
sequences,  comparison  of  protein  sequences,  and  applications 
in  quantitative  structure -activity  relationship  (QSAR).8*9 
Such  approaches,  that  have  been  hitherto  widely  used,  are 
computer  intensive. 

We  have  recently  proposed  an  alternative  approach  for 
comparison  of  sequences  that  is  based  on  characterization 
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of  DNA  by  ordered  sets  of  invariants  derived  for  DNA 
sequence,  rather  than  by  a  direct  comparison  of  DNA 
sequences  themselves.  This  is  analogous  to  the  use  of  graph 
invariants  (topological  indices)  for  characterization  of  mol¬ 
ecules  rather  than  use  of  information  on  their  geometry  and 
types  of  atoms  involved.  An  important  advantage  of  a 
characterization  of  structures  (be  it  molecule  or  DNA)  by 
invariants,  as  opposed  to  use  of  codes,  is  the  simplicity  of 
the  comparison  of  numerical  sequences  based  on  invariants. 
The  price  paid  is  a  loss  of  information  on  some  aspects  of 
the  structure  that  accompany  any  characterization  based  on 
invariants.  The  loss  of  information,  however,  can  be  in  part 
reduced  by  use  of  a  larger  number  of  descriptors  (invariants), 
as  has  been  well  illustrated  in  SAR  and  QSAR  based  on 
mathematical  descriptors  for  molecules.10-12 

Graphical  representations  of  DNA  that  have  been  devel¬ 
oped  within  the  past  few  years13-15  offer  a  route  to  one  such 
condensation  of  information  coded  by  DNA  primary  se¬ 
quence  into  a  set  of  invariants.  In  Figure  1  we  show  few 
graphical  representations  of  selected  DNA  as  reported  by 
Nandy.16  The  graphs  are  obtained  by  assigning  to  the  four 
directions  associated  with  the  positive  and  the  negative  x,  y 
axes  the  four  nucleic  acid  bases  A,  G,  C,  T,  such  that  A  and 
T  correspond  to  the  negative  x,  y  axes,  respectively,  and  G 
and  C  correspond  to  the  positive  x  and  y  axes,  respectively. 
An  advantage  of  graphical  representations  of  DNA  is  that  it 
allows  visual  comparisons  which  are  easier  to  make.  One 
should,  however,  be  aware  of  a  loss  of  information  inherent 
in  such  graphical  representations.  One  of  the  limitations  is 
that  graphical  form  shows  the  “path”  of  the  “travel”  along 
the  primary  sequence  but  not  the  “history”  of  the  travel. 
Hence ,  we  do  not  know  when  what  parts  of  the  graphical 
path  were  retraced.  At  the  top  of  Figure  2  we  show  a 
graphical  representation  of  the  first  exon  of  the  human  jS 
globin  gene,  at  a  higher  magnification.  The  rest  of  Figure  2 
shows  the  first  exon  of  ft  globin  gene  of  several  other  species 
for  comparison.  As  we  can  see  upon  inspection  qualitative 

©  2000  American  Chemical  Society 


10. 102 l/ci000034q  CCC:  $19.00 

Published  on  Web  08/15/2000 


1236  J.  Chem.  Inf.  Comput.  Sci.,  Vol.  40,  No.  5,  2000 


Randic  et  al. 


Table  1.  DNA  of  Length  1424  Listing  Nucleic  Bases  in  Human 
Beta  Globin  Gene" 

ATGGTGCACCrcACTCCTGAGGAGAAGTCrGCCGTTACrGCCCrGTGGG 

GCAAGGTGAACCTGGATGAAGTTGGTGGTGAGGCCCTGGCCAGGTTGG 

TATCAACGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCA 

TGTGGAGACAGAGAAGACTCrTGGGTTTCTGATAGGCACrGACTCTCTC 

TGCCTATTGGTCTATTTTCCCACCCTTAGCCrGCTGGTGGTCTACCCITGG 

ACCCAGAGGITCTTTGAGTCCTTTGGGGATCrGTCCACTCCTGATGCTGT 

TATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGC 

CTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCXA 

CACTGAGTGAGCrGCACTGTGACAAGCrGCACGTGGATCCTGAGAACTT 

CAGGGTGAGTCTATGGGACCCTTGATGTTTTCTrTCCCCTTCTTTTCTATG 

G1TAAGTTCATGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAG 

AATGGGAAACACACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCG 

TTTTAGTTTCi  i  l  iATTTGCTGTTCATA  AC  A  ATTG 1 1 1 JCI 1 1 IGTTTAAT 

CCTT  AAC  ATT  GTGTAT  A  AC  AA  AAGG  A  A  ATATCT  CT  G  AG  ATAC  ATTAAG 
TAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTG 
G  AATAT  ATGTGTGCTT  ATTTGCAT  ATTCATA  ATCTCCCT  ACTTT  ATTTTC 
TA'rCTTATTTCTAATACTTTCCCTAATCTCTTTOTTCAGGGCAATAATG 
ATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTCAT 
AATTTCTGGGTTAAGGCAATAGCAATATTTCTGCATATAAATATTTCTG 
CATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGC 
TACAATCCACCTACCATTCTGCrnTATnTATGGTTCGGATAAGGCTG 
GATTATTCTGACTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTC 
TTATCTTCCTCCCACAGCTCCIGGGCAACGTGCTGGTCrGTGTGCTGGCCC 
ATCACnTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAA 

AGTGGTGGCrGGTGTGGCTAATGCCCTGGCCCACAAGTATCACrAA 

TAT CTT ATTTCT A  ATACTTTCCCTA  ATCTC I  TTCT  I  T C AGGGC A  ATA  AT G 

ATACAATGrATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTCAT 

AATTTCTGGGTTAAGGCAATAGCAATATTTCTGCATATAAATATTTCTG 

CATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATACCAGC 

TACAATCCAGCTACCATTCTGCTTTTATTTTATGGTTGGGATAAGGCTG 

GATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTC 

TTATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCrGTGTGCTGGCCC 

ATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAA 

AGTGGTG  G  CTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA 

a  ID  HSHBB  -  beta  globin  gene  sequence  extract:  exons:  1-92, 
223-445,  1296-1424;  introns:  93-222,446-1295.  SQ  Hshbb.MKl 
-  -  segment  from  62205  to  63628  of  HSHBB. 

similarities  and  differences  between  exons  of  different  species 
are  immediately  apparent. 

Mathematical  curves  can  be  represented  in  the  form  f(jt, 
y)  =  0,  which  corresponds  to  graphical  projections  of  DNA 
of  Figure  2,  and  in  a  parametric  form  x  =  jc(t)  and  y  =  y(t). 
Clearly  there  is  a  loss  of  information  in  going  from  a 
parametric  representation  of  a  curve  x  =  *(t)  and  y  =  y(t)  to 
the  analytical  representation  of  the  same  curve.  The  f(xt  y) 
=  0  only  represents  the  path,  while  the  former,  if  the 
parameter  t  is  interpreted  as  time,  gives  the  history  of  the 
movement  over  the  path.  Equally,  there  is  loss  of  information 
when  a  a  spatial  curve  is  represented  by  its  projection  in  the 
(x,  y)  plane  (or  any  other  plane).  Hence,  two  routes  for  an 


Table  2.  First  Exon  of  Beta  Globin  Gene  for  Eight  Species  Labeled 
A-H 

A  human  b«ta-glob in  92 

ATGCTGCACCTGACTCCTGACCAGAAGTCTGCCGTTACTGCCCTGrGGG 

GCAAGGTGAACGTGGATTAAGTTGGTGGIGAGGCXaGGGCAG 

B  goat  alanine  bcLa*globin  56  bases 

ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGG 
TGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG 
C  opossum  bcta-hcmoglobfn  b«ta  M-gen«  92  bases 

ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACrACCATCTGGT 
CTAAGGTGCAGGTTCACCAGACTCGTGGTGAGGCCCTTCCCAC 
D  gallus  gallus  beta  globin  92  bases 

ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGG 
GCAAGGTCAATGTGGCCGAATGTCGGGCCGAAGCCCTGGCCAC 
E  lemur  beta-glob  in  92  bases 

ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGG 
CGAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG 

F  mouse  beta-a-globin  93  bases 

ATGGITGCACCrCACrCATCCTGAGAAGrcrGCrGTCTCITGCCTCTCGC 
CAAAGGTGAACCCCGATGAAGTrGGTGGTCAGCCCCrGCCCAGC 
G  rabbit beta-globin  90  bases 

ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGG 
GCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC 
H  rat  beta-globin  92  bases 

ATGCTCCACCrAACIGATCCTGAGAAGGCTACl'GITAGTGGCCTGTGGG 

CAAAGGTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG 


1:  Horse 

2:  Rhesus  Monkey 
3:  Orang-utan 
A:  Goat 

Figure  1.  Few  graphical  representations  of  selected  DNA  that 
Nandy  and  collaborators  developed. 

improvement  of  graphical  representations  of  DNA  sequences 
appear  possible:  (l)  to  consider  representation  analogous 
to  parametric  representation  of  mathematical  curves  and  (2) 
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Figure  2,  Graphical  representations  of  the  first  exon  of  the  human  beta  globin  gene  (top),  a  “detail”  of  Figure  1  and  the  remaining  beta 
globin  genes  of  Table  2. 


to  consider  graphical  representation  of  DNA  sequence  with 
“path”  which  is  in  traced  in  3D  space,  rather  than  a  plane. 
In  this  paper  we  will  limit  our  attention  to  this  latter  problem. 
We  will  then  describe  a  scheme  which  generates  for  a 
graphical  spatial  representation  of  DNA  a  numerical  matrix. 
Once  we  arrive  at  a  matrix  representation  of  DNA  we  will 
search  for  suitable  matrix  invariants  to  be  used  for  charac¬ 
terization  of  DNA.  Finally  we  will  consider  possible 
condensation  of  derived  numerical  characterization  of  DNA 
in  a  more  compact  format. 

3-D  REPRESENTATION  OF  DNA  PRIMARY  SEQUENCE 

Two-dimensional  representation  of  DNA  developed  by 
Nandy4  assigned  to  the  four  directions  defined  by  the  positive 
and  the  negative  x  and  y  coordinate  axes  to  the  four  nucleic 
bases  so  that  A  and  G  are  associated  with  the  jc-axis  and  C 
and  T  with  the  y  axis.  This  assignment  of  directions  differs 
from  the  assignment  considered  by  Leong  and  Morgentha- 


ler,14  who  take  a  move  to  the  right  to  correspond  to  A,  a 
move  to  the  left  is  C,  an  upward  move  is  a  T,  and  a 
downward  move  is  G. 

The  nonequivalent  directions  are  created  after  assignments 
of  the  first  base  because  then  there  remains  only  one  site 
that  is  opposite  to  the  already  selected  direction;  the  other 
two  sites  are  at  lateral  positions.  If  we  could  have  three 
equivalent  directions  after  the  first  assignment  we  would 
avoid  considering  the  multiplicity  of  alternatives  (projec¬ 
tions).  This  is  possible  by  using  the  directions  defined  by 
vertices  of  a  regular  tetrahedron.  When  looking  from  its 
center  all  the  four  directions  toward  the  four  vertices  are 
equivalent,  hence  after  selecting  one  direction  the  three 
directions  remain  equivalent.  Hence,  we  will  assign  to  four 
nucleic  acid  bases  the  four  directions  associated  with  the 
regular  tetrahedron.  To  specify  directions  we  will  place  the 
origin  of  the  Cartesian  (jc,  y,  z)  coordinate  system  in  the  center 
of  a  cube  (Figure  3)  so  that  the  four  corners  of  the  cube, 
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Figure  3.  The  tetrahedral  directions  assigned  to  A,  G,  C,  T  nucleic 
bases. 


Table  3.  Cartesian  3-D  Coordinates  for  Initial  Part  of  the  Sequence 
of  DNA  Nucleic  Bases  of  the  First  Exon 


X 

y 

z 

JC 

y 

z 

1 

A 

+  1 

-i 

-  1 

15 

T 

-1 

+  i 

+  1 

2 

T 

+  2 

0 

0 

16 

C 

-2 

0 

+  2 

3 

G 

+  1 

+  i 

-1 

17 

C 

-3 

-l 

+  3 

4 

G 

0 

.+  2 

-2 

18 

T 

-2 

0 

+  4 

5 

T 

+  1 

+  3 

-1 

19 

G 

-3 

+  1 

+  3 

6 

G 

0 

+4 

-2 

20 

A 

-2 

0 

+  2 

7 

C 

-1 

+  3 

-1 

21 

G 

-3 

+  1 

+  1 

8 

A 

0 

+2 

-2 

22 

G 

-4 

+2 

0 

9 

C 

“1 

+  1 

“1 

23 

A 

-3 

+  1 

-1 

10 

c 

-2 

0 

0 

24 

G 

-4 

+  2 

-2 

11 

T 

-1 

+  1 

+  1 

25 

A 

-3 

+  1 

-3 

12 

G 

-2 

+  2 

0 

26 

A 

-1 

0 

-4 

13 

A 

“1 

+  1 

-1 

27 

G 

-3 

+  1 

-5 

14 

C 

-2 

0 

0 

28 

T 

-2 

+  2 

-2 

which  define  the  tetrahedral  directions,  have  the  coordinates 
(+1,  -1,  -1),  (-I,  +1,  -1),  (-1,  -I,  +1),  and  (+1,  +1, 
4*  1).  To  each  tetrahedral  direction  we  assign  one  nucleic  base 
as  follows: 

(+1,  -1,  -1)—  A 
C-1,+1,-1)  — G 
(-lf-l,+l)-*C 
(+1,  +1,  +1)  T 

The  particular  assignment  is  arbitrary,  but  this  has  no 
significance  since  all  directions  are  equivalent.  To  obtain  the 
spatial  path  associated  with  the  DNA  sequence,  we  move  in 
x ,  y,  z  space  in  the  direction  that  the  above  assignments  dic¬ 
tates.  Consider  the  beginning  of  the  first  exon  of  Table  1: 

ATGGTGC  A.... 

The  first  point  of  the  spatial  curve  is  at  point  (+ 1 ,  - 1 ,  - 1 ) 
which  belongs  to  A,  so  directed  from  the  origin.  From  that 
point  we  move  in  the  direction  assigned  to  T,  (+ 1 ,  + 1 ,  + 1 ), 
which  means  that  all  the  three  coordinates  of  the  position 
A,  (+1,  -1,  -1),  have  to  be  increased  by  +1.  We  arrive 
then  at  the  point  (+2,  0,  0)  as  the  location  of  T.  From  here 
we  move  in  the  direction  defined  by  ( —  1,  +1,  —  1)  assigned 
to  G  telling  that  the  first  and  the  third  coordinates  have 
decreased  while  the  second  coordinate  has  increased.  This 
leads  to  point  (+ 1,  + 1,  - 1)  as  the  location  of  G.  Continuing 
in  the  direction  of  G  we  have  again  to  decrease  x  and  z  (the 
first  and  the  third  coordinates)  and  to  increase  y  (the  second 
coordinate).  Thus  we  come  to  the  point  (0,  +2,  -2).  The 
process  continues,  each  time  we  algebraically  add  the  (*,  y, 
z)  coordinates  of  the  new  point  to  that  of  the  last  point. 
Continuation  of  this  process  is  illustrated  in  Table  3  for  the 
two  dozen  initial  nucleic  bases  of  the  first  exon.  In  Figure  4 


Figure  4.  Portion  of  3-D  graphical  representation  of  DNA  of 
Table  1. 


we  show  a  portion  of  3-D  graphical  representation  of  DNA 
of  Table  1. 


NUMERICAL  CHARACTERIZATION  OF  SPATIAL 
REPRESENTATION  OF  DNA 

An  important  advantage  of  graphical  representations  of 
DNA,  both  2-D  and  3-D,  is  the  possibility  to  derive  numerical 
characterizations  for  such  mathematical  objects.  One  way 
to  arrive  at  numerical  characterization  of  DNA  is  to  associate 
with  its  graphical  representation  given  by  a  curve  in  the  space 
(or  a  plane)  a  matrix.  Once  we  have  a  matrix  we  can  use 
matrix  invariants  arrive  at  various  numerical  descriptors, 
rather  than  the  visual  description  of  the  DNA  sequence.  This 
is  analogous  to  the  use  of  matrices  associated  with  molecular 
graphs  or  molecular  structure  as  a  source  for  construction 
of  topological  indices  rather  than  using  molecular  models 
(such  as  “sticks-and-balls  or  “space-filling”  models)  for  their 
representation.10 

Formally,  there  is  no  difference  between  a  graphical 
“sequence  chain”  (in  2-D  or  3-D  space)  or  an  actual  polymer 
(“atom  chain”)  in  the  space.  Hence,  we  can  transfer 
mathematical  methods  used  for  the  characterization  of 
molecules  in  structure- property  and  the  structure-activity 
studies  to  numerical  characterization  of  3-D  representations 
of  the  primary  DNA  sequence.  This  has  been  considered 
recently  by  Randic  and  collaborators17  for  2-D  graphical 
representation  of  DNA. 

We  should  mention  that  one  can  also  arrive  at  numerical 
descriptors  that  may  be  specific  and  sensitive  to  graphical 
form  of  a  DNA  without  necessarily  resorting  to  matrices. 
Thus,  for  example,  Raychaudhury  and  Nandy18  considered 
several  geometrical  parameters  of  DNA  curves,  such  as,  for 
example,  end-to-end  distance  as  DNA  descriptors.  Matrices, 
however,  offer  additional  descriptors  and  richer  characteriza¬ 
tion  and  can  be  manipulated  by  a  computer,  and  one  can 
take  other  advantages  of  linear  algebra,  rather  than  being 
confined  to  ordinary  geometry. 

Search  for  novel  descriptors  may  be  an  endless  project, 
just  as  this  has  been  the  case  with  mathematical  descriptors 
that  continue  to  be  constructed  for  molecules.  However,  the 
art  is  in  finding  useful  descriptors,  and  those  that  have 
plausible  structural  interpretation,  at  least  within  the  model 
considered.  Matrices  have  an  additional  advantage:  they 
allow  one  to  construct  additional  matrices  by  combining 
elements  of  different  matrices  as  components.  In  this  way 
one  can  arrive  at  additional  descriptors  for  DNA.  In  this 
report  we  will  confine  our  interest  particularly  to  the  graph 
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Table  4.  Leading  Eigenvalues  for  D/D  Matrices  of  Eight-Atom 
Chains  Embedded  on  a  Graphite  Lattice  and  the  Leading 
Eigenvalues  of  the  Corresponding  Line  Adjacency  Matrices 


conformer 

X\!n  of  D/D  matrix 

A| In  of  line  adjacency  matrix 

ttttt 

0.7903 

0.8571 

TTTTC 

0.7695 

0.7191 

TTTCT 

0.7613 

0.5916 

TTCTT 

0.7541 

0.5208 

CTTTC 

0.7506 

0.5858 

TTCTC 

0.7451 

0.4688 

TCTCT 

0.7448 

0.4019 

TCTTC 

0.7365 

0.4748 

CTCTC 

0.7336 

0.3836 

TTTCC 

0.7250 

0.5793 

TCTCC 

0.7112 

0.3773 

TTCCT 

0.7021 

0.4464 

CTTCC 

0.6997 

0.4533 

TCCTC 

0.6966 

0.3538 

CCTCC 

0.6786 

0.3375 

TTCCC 

0.6654 

0.4426 

CTCCC 

0.6587 

0.3347 

TCCCT 

0.6472 

0.3347 

possible.  It  belongs  to  the  hypothetical  all-cis  configuration 
CCCCC,  the  projection  of  which  on  hexagonal  lattice  gives 
a  regular  hexagon.  In  this  structure  the  first  and  the  last  CC 
bond  of  Cg  would  overlap,  giving  for  X\  =  4.6388,  which 
when  normalized  becomes  Ai/8  =  0.5798.  The  relative 
magnitudes  of  X\ In  and  the  shape  of  corresponding  confor¬ 
mations  fully  supports  the  interpretation  of  the  normalized 
eigenvalue  of  D/D  matrices  as  an  index  of  the  folding  of  a 
structure. 


Figure  5.  Conformations  of  eight-atom  chain  embedded  on  a 
graphite  lattice  ordered  according  to  decreasing  values  of  the  leading 
eigenvalue  of  D/D  matrix. 

theoretical  distance  matrix  and  the  Euclidean  distance  matrix 
for  characterization  of  graphical  forms  of  DNA. 

MATRICES  INVOLVING  DISTANCES 

The  input  information  in  a  graph  distance  matrix19,20  is 
solely  confined  to  the  information  on  the  connectivity  of  the 
structure  (system).  However,  when  a  graph  is  embedded  in 
a  space  it  assumes  a  fixed  geometry.  Then,  in  addition  to 
the  graph  theoretical  distance  between  a  pair  of  vertices,  we 
can  also  compute  the  Euclidean  distances  between  the  same 
pair  of  vertices.  The  Euclidean  and  the  graph  theoretical 
distances  can  be  combined  into  a  single  distance/distance 
matrix  by  taking  the  quotient  of  the  corresponding  matrix 
elements.21,22  Collection  of  such  quotients  for  all  pairs  of 
vertices  leads  to  the  so-called  D/D  matrix.  Matrices  con¬ 
structed  in  this  way  proved  very  promising  as  a  tool  for 
characterization  of  structures  embedded  in  3-D  space.  The 
normalized  leading  eigenvalue  X\ln  of  a  D/D  matrix  offers 
a  measure  of  the  degree  of  folding  of  a  chainlike  structure 
or  a  curve.  In  Figure  5  we  illustrated  configurations  of  an 
eight-atom  Cg  chains  embedded  on  a  graphite  lattice.  Under 
each  skeleton  is  given  the  normalized  X\in  of  D/D  matrix. 
As  we  see  the  largest  eigenvalue  (Aj/8  =  0.7903)  is  associated 
with  the  least  bent  all-trans  configuration  of  Cg,  and  the 
smallest  eigenvalue  (Aj/8  =  0.6472)  belongs  to  the  highly 
folded  isomer  TCCCT.  T  and  C  label  stand  for  trans  and  cis 
conformations  of  three  consecutive  CC  bonds  (consult  Table 
4  for  structures  belonging  to  different  labels).  For  chains  of 
seven  CC  bonds  even  a  smaller  eigenvalue  than  0.6472  is 


A  single  descriptor,  even  though  it  may  be  instructive, 
offers  but  a  limited  characterization  for  a  large  system.  Often 
additional  descriptors  are  needed.  They  can  be  constructed 
by  considering  the  so-called  “higher  order”  D/D  matrices.23 
These  matrices  are  obtained  by  taking  the  powers  of  the 
quotients  of  two  distances,  rather  than  just  using  the  quotients 
of  the  distances  themselves.  As  a  result  we  can  derive  for  a 
geometrical  (graphical-spatial)  representation  of  DNA  an 
algebraic  characterization  based  on  set  of  invariants,  obtained 
by  calculating  the  leading  eigenvalue  of  the  set  of  “higher 
order”  matrices  "D/flD.  We  will  continue  to  use  simplified 
notation  D/D  even  though  the  D  in  the  numerator  stands  for 
the  Euclidean  distances  and  the  D  in  the  denominator  stands 
for  graph  theoretical  distances. 

D/D  MATRICES  FOR  DNA 

The  Euclidean  distance  between  bases  in  a  3-D  graphical 
model  of  DNA  are  obtained  from  the  3-D  coordinates  of 
the  nucleic  bases  listed  in  Table  3  using  {(jc*  -  Xj)2  4-  (yj  — 
>j)2  +  (Zi  ”  Zj)2}1/2,  where  xit  yiy  z\  and  Xj,  yj,  z\  are  the 
Cartesian  coordinates  of  the  points  considered.  To  obtain  the 
D/D  matrix  first  we  have  to  normalize  the  distance  scale  so 
that  the  Euclidean  distance  between  adjacent  vertices  equals 
1,  not  -s/ 3  (as  a  result  of  taking  the  side  of  cube  to  be  equal 
1).  Then  we  have  to  divide  each  Euclidean  distance  with 
the  number  of  bonds  separating  the  two  vertices  to  obtain 
the  desired  quotient  of  the  two  distances.  In  Table  5  we 
illustrate  a  part  of  the  D/D  matrix  (corresponding  to  nine 
initial  bases  of  DNA  primary  sequences  of  exon  1  of  human 

gene).  The  numerator  combined  with  factor  1/%/ 3  gives 
the  Euclidean  distance  between  vertices  i,  j  when  the 
separation  between  adjacent  bases  is  assigned  distance  1,  and 
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Table  5.  Portion  of  the  D/D  Matrix  for  the  First  Exon  of  DNA  of  Table  1 


0  1  2/2 V 3 

7l  1/373 

4/473 

727/5  73 

78/673 

711/773 

78/873 

0  1 

V 12/2  73 

711/373 

724/473 

7 1 9/5  7  3 

712/673 

711/773 

0 

1 

2/273 

7  il  /3  73 

78/473 

73/573 

2/673 

0 

1 

2/273 

73/373 

0 

73/573 

0 

1 

2/273 

73/373 

78/473 

0 

1 

2/273 

711/373 

0 

1 

0 

2/2  ^3 

1 

0 

Table  6.  Numerical  Values  for  the  Initial  Portion  of  D/D  Matrix  and 

“Higher  Order”  D/D  Matrices® 

0  1  0.57735 

0.63828 

0.57735 

0.60000 

0.27217 

0.27355 

0.20412 

0.33333 

0.40741 

0.33333 

0.36000 

0.07407 

0.07483 

0.04167 

*  0.I1I11 

0.16598 

0.11111 

0.12960 

0.00549 

0.00560 

0.00174 

-0.12345 

0.02755 

0.12345 

0.01680 

3.011-5 

3.135-5 

3.014-6 

1.524-4 

7.590-4 

1.524-4 

2.821-4 

9.064-10 

9.831-10 

9.085-  12 

0  1 

1 

0.63828 

0.70711 

0.50332 

0.33333 

0.27355 

0.40741 

0.50000 

0.25333 

0.11111 

0.07483 

0.16598 

0.25000 

0.06418 

0.12345 

0.00560 

0.02755 

0.06250 

0.00412 

1.524-4 

3.135-5 

7.590-4 

0.00391 

1.696-5 

2.323-8 

9.831-10 

0 

1 

0.57735 

0.63828 

0.40825 

0.20000 

0.19245 

0.33333 

0.40741 

0.16667 

0.04000 

0.03704 

0.11111 

0.16598 

0.02778 

0.00160 

0.00137 

0.12345 

0.02755 

7.716-4 

2.560-6 

1.882-6 

1.524-4 

7.590-4 

5.954-7 

6.554-  12 

3.541-12 

0 

1 

0.57735 

0.33333 

0 

0.20000 

0.33333 

0.11111 

0.04000 

0.11111 

0.12345 

0.00160 

0.12345 

1.524-4 

2.560-6 

1.524-4 

2.323-8 

6.554-  12 

0 

1 

0.57735 

0.33333 

0.40825 

0.33333 

0.11111 

0.16667 

0.11111 

0.12345 

0.02778 

0.12345 

1.524-4 

7.716-4 

1.524-4 

2.323-8 

5.954-7 

0 

1 

0.57735 

0.63828 

0.33333 

0.40741 

0.11111 

0.16598 

0.12345 

0.02755 

1.524-4 

7.590-4 

0 

1 

0 

0.57735 

0.33333 

0.11111 

0.12345 

1.524-4 

1 

0 

>/*D,  and  16D/I6D, 

The  first  row  is  each  box  is  the  numerical  value 

of  D/D  element,  while  the  successive 

rows  correspond  to  2D/2D,  4D/4D,  8I 

respectively. 

the  denominator  is  the  graph  theoretical  distance  between 
the  same  two  vertices. 

The  “higher  order”  D/D  matrices  are  constructed  by  raising 
the  elements  of  the  D/D  matrix  (Table  5)  to  an  ever 
increasing  power.  In  Table  6  we  show  the  corresponding 
entries  of  the  higher  order  D/D  matrices  which  are  grouped 
into  a  single  matrix  where  each  row  gives  the  numerical 
values  corresponding  to  matrix  elements  of  D/D,  ^/^D,  4D/ 
4D,  8D/8D,  and  ,6D/16D.  As  we  can  see  all  matrix  elements 
that  are  smaller  than  one  decrease  as  the  exponents  of  the 
power  increase.  If  one  continues  to  raise  exponents  to  even 
higher  powers  all  the  elements  of  "D/nD  matrix  that  are 
different  from  one  would  soon  become  very  small  and  could 
be  neglected.  Hence,  in  the  limit  as  n  -*  °°  they  are  zero, 
and  the  resulting  D/D  matrix  reduces  to  a  binary  matrix.  In 
Table  7  we  show  the  initial  part  of  the  limiting  binary  matrix 
°°D/“D  for  the  first  exon  of  DNA  of  Table  1  again  displaying 
only  a  9  x  9  section.  As  we  can  see,  all  the  elements  above 


Table  7.  Initial  Portion  of  the  Limiting  (Symmetrical)  Matrix  of 
"D/"D  Matrix  Truncated  at  n  -  16® 


1 

2 

3 

4 

5 

6 

7  8 

9  10  11  12 

1  0 

1 

2  1 

0 

1 

1 

3 

1 

0 

1 

4 

1 

l 

0 

1 

5 

I 

0 

1 

6 

1 

0 

1 

7 

1 

0  1 

8 

1  0 

1 

9 

1 

0  1  1 

10 

1  0  1 

11 

110  1 

12 

1  0 

*  Only 

zeros 

at  the  diagonal  position  arc  shown. 

the  main  diagonal  of  the  limiting  matrix  corresponding  to 
adjacent  sites  in  the  DNA  chain  are  necessarily  equal  to  1. 
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Figure  6.  Conformations  of  line-adjacency  graphs  of  eight-atom 
chains  embedded  on  a  graphite  lattice  ordered  according  to 
decreasing  values  of  leading  eigenvalue  of  line  adjacency  matrix. 
The  order  of  isomers  in  Figure  5  and  this  figure  is  different. 


However,  entry  1  appears  in  addition  at  all  sites  associated 
with  a  repetition  of  the  same  nucleic  base  in  the  primary 
DNA  sequence.  For  the  first  exon  of  Table  1  this  happens 
at  sites  3,  4  and  9,  10,  and  so  on.  When  constructing  the 
3-D  graphical  model  at  these  sites  we  continue  to  move  in 
the  same  direction,  and  the  corresponding  segment  of  the 
3-D  graphical  model  forms  a  line  segment.  Hence,  the 
elements  of  the  limiting  matrix  indicate  the  so-called  “line 
adjacency”.  The  limiting  matrix,  referred  to  as  the  “line 
adjacency  matrix”,24  is  known  in  Graph  Theory  as  the 
adjacency  matrix  of  the  Menger  graph  of  a  configuration.25 
For  graphs  of  Figure  5  we  show  the  corresponding  Menger 
graphs.  Their  “line  adjacency”  matrix  represents  the  limiting 
~D/°°D  matrices.  They  are  also  embedded  in  a  plane  because 
they  have  been  derived  from  already  embedded  graphs. 

A  comparison  of  Figures  5  and  6  shows  that  line  adjacency 
matrix  carries  different  information  than  the  D/D  matrices 
from  which  it  was  algebraically  constructed.  The  graphs  in 
Figure  5  are  ordered  according  to  descending  magnitudes 
of  the  normalized  leading  eigenvalue  of  the  adjacency  matrix, 
and  the  graphs  in  Figure  6  are  ordered  according  to  the 
leading  eigenvalue  of  the  limiting  matrix.  The  resulting  order 
is  different  from  the  order  induced  by  the  leading  eigenvalue 
of  D/D  matrix.  The  leading  eigenvalue  of  the  limiting  matrix 
can  be  viewed  as  an  index  of  flexibility  (or  stiffness)  of  a 
structure,  at  least  in  some  special  cases.24  Apparently 
structures  with  longer  “line”  segments  have  larger  X\  or  k\I 
n.  When  this  is  “translated”  to  the  graphical  representation 
relating  to  DNA  sequences,  the  occurrence  of  “straight” 
segments  corresponds  to  recurrence  of  the  same  base  in  a 
sequence  repeatedly.  Hence,  DNA  sequences  with  a  larger 


ii 


Figure  7.  Projection  of  a  portion  of  3-D  graphical  representation 
of  DNA  of  Figure  4  on  the  ( x ,  y),  (jc,  z),  and  ( y ,  z)  coordinate  planes. 

number  of  repeating  bases  and  longer  such  repeating  seg¬ 
ments  will  have  a  larger  leading  eigenvalue  of  the  limiting 
binary  matrix  "DAD. 

PROJECTIONS  OF  3-D  SPATIAL  SEQUENCE 
REPRESENTATION 

Spatial  curves  can  be  projected  on  coordinate  planes  (x, 
y),  (jc,  z)  or  (y,  z),  or  any  plane,  for  that  matter.  The 
projections  of  3-D  spatial  curves  on  each  of  the  three 
coordinate  planes  is  quite  simple  when  coordinates  of  all 
the  points  are  known.  All  that  is  needed  is  to  ignore  the 
coordinate  perpendicular  to  the  plane  of  the  projection. 
Hence,  for  the  first  nucleic  base  of  Table  1,  A,  with  spatial 
coordinates  (+1,  -1,  -1)  we  have  for  the  projection  on  the 
x,  y  plane  x  =  1  and  y  =  —  1.  For  the  projection  of  the  same 
base  on  the  x ,  z  plane  we  have  x  =  1  and  z  =  -1,  while  for 
the  projection  of  the  first  nucleic  base  on  the  y,  z  plane  we 
obtain  y  =  - 1  and  z  =  - 1.  Hence,  the  projection  coordinates 
can  be  read  directly  from  Table  2  by  ignoring  one  column, 
depending  on  the  projection  considered.  In  Figure  7  we  show 
the  three  projections  for  the  first  12  bases  of  exon  of  DNA 
of  Table  1.  It  is  interesting  to  observe  that  projection  of  the 
spatial  3-D  representation  of  DNA  on  the  (x,  y)  coordinate 
plane  is  identical  with  the  2-D  graphical  representation  of 
Nandy26’27  already  depicted  at  the  top  of  Figure  2.  Hence, 
our  3-D  visual  representation  of  DNA  contains  automatically 
the  2-D  graphical  representation  of  Nandy  as  one  of  its 
projections.  This,  however,  is  not  surprising,  because  if  we 
project  the  four  vertices  of  the  tetrahedron  having  the 
coordinates  (+1,  -1,  -1),  (-1,  +1,  -1),  (-1,  -1,  +1), 
(+ 1 ,  + 1 ,  + 1 )  on  the  (x,  y)  plane  we  obtain  points  (+ 1 ,  -  1 ), 
(-1,  +1),  (-1,  -1),  (+1,  -FI).  The  first  set  of  points  is 
associated  with  directions  for  A,  G,  C,  T  in  3-D  as  outlined 
in  this  paper,  and  the  second  set  of  points  is  associated  with 
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directions  for  A,  G,  C,  T  in  2-D  that  coincides  with  that  of 
Nandy  if  we  rotate  the  coordinate  system  by  —135°. 

Similarly  we  find  that  the  projection  of  the  spatial  3-D 
representation  of  DNA  on  the  ( x ,  z)  coordinate  plane  is 
identical  with  the  2-D  graphical  representation  of  Leong  and 
Morgenthaler.14  Hence,  our  3-D  visual  representation  of  DNA 
contains  alternative  2-D  graphical  representations  as  its 
projections.  We  may  add  that  there  is  third  yet  the  projection 
of  3-D  graphical  representation  of  DNA,  the  projection  on 
the  plane  (y,  z),  that  corresponds  to  the  assignment  of  the 
four  directions  defined  by  the  positive  and  the  negative  x 
and  y  coordinate  axes  to  the  four  nucleic  bases  so  that  A 
and  T  are  associated  with  the  x-axis  and  C  and  G  with  the 
y  axis.  As  we  see  from  Figure  7  this  projection  differs  from 
those  of  Nandy,  Leong,  and  Morgenthaler  and  may  have  its 
own  merits.  Finally,  we  should  add  that  one  can  consider 
projections  of  3-D  graphical  curves  of  DNA  on  planes  other 
than  coordinate  planes.  While  projections  offer  convenience 
of  2-D  representation,  all  these  projections  are  associated 
with  some  loss  of  information  associated  with  the  projection 
process. 

Although  the  three  projection  paths  of  the  3-D  representa¬ 
tion  of  DNA  are  different,  their  limiting  matrices  are 
identical.  This  can  be  understood,  because  the  form  of  the 
limiting  matrix  depends  only  on  the  repetition  of  same 
nucleic  base  in  the  primary  sequence  of  DNA  and  that  is 
independent  of  graphical  representation  of  DNA  and  the 
projection  process. 

MATRIX  INVARIANTS  OF  DNA 

The  search  for  a  matrix  representation  of  DNA  primary 
sequence  was  motivated  by  desire  to  have  numerical  descrip¬ 
tors  for  DNA  that  are  sequence  invariants.  Numerical 
characterization  of  DNA  primary  sequences  will  make 
comparisons  of  different  DNA  sequences  much  simpler  than 
comparison  based  on  alphabet  symbols  or  the  corresponding 
codes.  Moreover,  it  will  lead  to  quantitative  measure  of 
similarity  and  may  open  a  novel  method  of  characterizations 
for  the  same  set  of  sequences.  Matrices  not  only  offer  various 
inherent  invariants  as  a  tool  for  such  comparisons  but  also 
allow  one  to  consider  modifications  of  matrix  elements  and 
in  this  way  may  further  enrich  the  tool  for  comparative  study 
of  DNA.  In  this  report  we  will  continue  to  confine  our 
attention  to  D/D  matrix  of  DNA,  but  it  will  be  clear  that  the 
outlined  schemes  are  equally  valid  not  only  for  the  “higher 
order”  D/D  matrices  but  also  for  other  matrices  that  one  can 
associate  with  DNA. 

Among  numerous  matrix  (and  graph)  invariants  we  will 
consider  first  the  average  matrix  element,  which  in  the  case 
of  the  graph  theoretical  distance  matrix,  except  for  normal¬ 
ization,  is  related  to  the  Wiener  number,  a  well-known  graph 
theoretical  invariant.28,29  Alternatively  one  can  consider  the 
average  row  sum,  which  differs  from  the  average  matrix 
element  and  the  Wiener  number  again  only  by  normalization 
factor.  The  average  row  sum  has  an  advantage,  particularly 
when  the  individual  row  sums  do  not  differ  widely,  because 
it  may  suggest  an  approximate  value  for  the  leading 
eigenvalue  of  the  matrix.  According  to  the  Frobenius-Perron 
theorem  of  linear  algebra  the  largest  and  the  smallest  row 
sums  represent  the  upper  and  the  lower  bounds,  respectively, 
for  the  leading  eigenvalue  (Ai)  of  a  symmetric  matrix.30  In 
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Table  8.  The  Upper  Bounds,  the  Lower  Bounds,  the  Leading 
Eigenvalue,  and  the  Average  Row  Sums  for  Truncated  Matrices  of 
DNA 


row  sum  max 

A, 

row  sum  min 

row  sum  average 

1 

0 

0 

0 

0 

2 

1 

1 

1 

1 

3 

2 

1.732051 

1.57735 

1.71  8233 

4 

3 

2.629245 

2.21563 

2.607815 

5 

3.63828 

3.238402 

2.79298 

3.203444 

6 

4.34539 

3.869193 

3.39298 

3.843783 

7 

4.84871 

4.242930 

3.09442 

4.178791 

8 

5.18204 

4.455833 

2.71756 

4.335833 

9 

5.45559 

4.737987 

3.49400 

4.508241 

Table  9.  Average  Matrix  Element  as  a  Function  of  Gradually 
Truncated  D/D  Matrix 


x,  y,  z 

*>y 

x,  z 

i>y 

1 

0 

0 

0 

0 

2 

0.86603 

0.70711 

0.70711 

0.7071  1 

3 

1.21424 

1.07298 

0.62854 

1.07298 

4 

1.74711 

1.52917 

1.06066 

1.52917 

5 

2.00204 

1.82479 

0.90510 

1.82479 

6 

2.34274 

2.16431 

1.02138 

2.16431 

7 

2.35303 

2.25833 

1.23982 

2.07952 

8 

2.23832 

2.11133 

1.21440 

1.97442 

9 

2.25630 

2.13265 

1.24376 

1.89102 

10 

2.46497 

2.24965 

1.54132 

1.94565 

11 

2.51032 

2.19350 

1.71264 

2.03576 

12 

2.55077 

2.23357 

1.80319 

2.00313 

13 

2.47111 

2.15924 

1.75222 

1.92259 

14 

2.51231 

2.20976 

1.79277 

1.89779 

15 

2.50930 

2.12249 

1.84061 

1.92616 

16 

2.63879 

2.14294 

2.01366 

2.04107 

Table  8  we  have  listed  the  upper  bounds,  the  lower  bounds, 
and  the  leading  eigenvalue  for  truncated  sequence  of  DNA 
for  n  =  1  to  n  =  9.  Observe  how  closely  the  average  row 
sum  (given  in  the  last  column)  approximates  the  leading 
eigenvalue,  particularly  for  shorter  segments  of  the  matrix. 

The  leading  eigenvalue  of  a  matrix  is  an  important  matrix 
invariant.  We  have  already  mentioned  that  X\ In  of  the  D/D 
matrix  is  an  index  of  the  folding  of  a  structure,  and  X\fn  of 
the  limiting  matrix  can  be  viewed  as  an  index  of  the 
flexibility  of  a  system.  Similarly,  the  A|  of  the  adjacency 
matrix  and  X\  of  the  path  matrix  represent  alternative  indices 
of  (molecular)  branching,31*32  while  X\  of  the  D/DD  matrix, 
where  DD  represents  the  detour  matrix,33*34  is  an  index  of 
the  cyclicity  of  a  system.35,36  The  average  row  sum  may  give 
a  similar  insight  into  a  system  as  the  leading  eigenvalue. 
The  average  row  sum,  however,  can  be  easily  computed, 
while  computation  of  eigenvalues  of  large  matrices  is  more 
involved,  and,  of  course,  the  DNA  sequences  could  be  very 
long.  For  example,  the  1424  bases  of  Table  1,  of  which  we 
considered  the  first  exon  only  (92  bases),  are  a  part  of  73  326 
base  pairs.37 

The  average  row  sum,  and  also  the  average  matrix  element 
of  a  D/D  matrix,  will  depend  on  the  size  of  the  matrix  as  is 
seen  from  Table  9  where  under  the  heading  x,  y,  z  we  have 
listed  the  average  matrix  element  as  a  function  of  n ,  the  size 
of  the  matrix  at  truncation  of  DNA  sequence.  The  same  was 
true  for  the  leading  eigenvalue  of  the  truncated  DNA 
sequences  (Table  8). 

The  dramatic  condensation  of  data  illustrated  above  may 
be  excessive  for  some  more  ambitious  comparisons  of  DNA 
sequences.  In  such  cases,  one  can,  in  addition  to  D/D  matrix, 
also  consider  the  leading  eigenvalue  or  the  average  element 
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Table  10.  Leading  Eigenvalue  of  the  D/D  Matrix  and  Higher  Order 
D/D  Matrices  for  n  =  2  to  n  =  20  Showing  the  Convergence  for  X\ 
and  the  Limit  for  n  « 


power 

power 

Ai 

1 

4.73797 

12 

2.35418 

2 

3.54855 

13 

2.35143 

3 

2.99558 

14 

2.34966 

4 

2.71223 

15 

2.34851 

5 

2.55903 

16 

2.34777 

6 

2.47313 

17 

2.34729 

7 

2.42349 

18 

2.34696 

8 

2.39409 

19 

2.34675 

9 

2.37629 

20 

2.34661 

10 

2.36537 

limit 

2.34631654447882 

11 

2.35850 

of  2D/2D  matrix,  of  3D/3D  matrix,  and  so  on.  A  dozen  nD/"D 
matrices  can  in  this  way  offer  a  sufficient  number  of 
invariants  for  more  extensive  comparisons  of  DNA  se¬ 
quences.  In  Table  10  we  report  the  leading  eigenvalue  for  a 
9x9  "D/nD  matrices  for  n  =  1  to  n  =  20,  which  illustrate 
the  “profile,”  the  sequence  of  descriptors,  for  the  particular 
fragment  of  DNA.  As  n  increases  the  value  of  the  leading 
eigenvalue  X\  converges  to  a  limiting  value.  The  limit  can 
be  easily  computed  as  it  represents  the  leading  eigenvalue 
of  the  binary  matrix  of  the  same  size  (here  9  x  9).  Using  so 
constructed  “profiles”  the  calculation  of  the  similarities  of 
DNA  sequences  is  transformed  into  a  calculation  of  similari¬ 
ties  of  the  corresponding  numerical  sequences  of  DNA 
descriptors,  the  task  which  is  not  computer  intensive  if 
compared  to  the  similar  studies  using  alignment  methodolo¬ 
gies.  Of  course,  it  yet  remains  to  be  investigated  which  set 
of  invariants  may  offer  optimal  characterization  for  DNA 
comparisons  and  how  sensitive  are  such  “profiles”  to  minor 
changes  in  DNA  composition.  In  a  recent  study  in  which 
the  DNA  sequence  was  characterized  by  average  distances 
between  various  nucleic  acid  bases  it  was  shown  that  the 
“distance  profiles”,  constructed  analogously  to  the  here 
reported  “leading  eigenvalue  profile”,  is  very  sensitive 
already  when  a  single  nucleic  base  has  been  changed  (i.e., 
the  case  of  mutation).41 

CONCLUDING  REMARKS 

In  this  article  we  (1)  outlined  a  construction  of  a  3-D 
“graphical”  representation  of  DNA  primary  sequences, 
illustrated  on  a  portion  of  the  human  f  globin  gene;  (2) 
described  a  particular  scheme  that  allows  3-D  spatial 
representation  of  DNA  to  be  transformed  into  a  numerical 
matrix  representation;  (3)  illustrated  derivation  of  a  set  of 
matrix  invariants  from  the  matrix  representation  of  DNA; 
and  (4)  suggested  a  relative  simple  data  reduction  based  on 
statistical  analysis  of  generated  DNA  matrix  invariants.  Each 
of  the  four  contributions,  in  our  view,  not  only  will  facilitate 
comparative  studies  of  DNA  but  also  open  possibilities  for 
further  developments  of  condensation  of  primary  DNA 
sequence  information.  The  outlined  3-D  representation,  for 
example,  can  be  modified  by  use  of  the  sequential  labels  as 
the  fourth  coordinate  in  order  to  avoid  3-D  spatial  curves 
overlap  itself.  The  numerical  matrix  characterization  offers 
many  alternatives,  from  the  use  of  different  distance  measures 
to  the  use  of  different  matrix  forms.  In  addition  to  the 
possibility  of  selecting  matrix  invariants,  which  is  almost 
unlimited,  we  have  the  possibility  of  selecting  different 
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matrices  to  start  the  process  of  condensation  of  data.  Hence, 
we  anticipate  here  an  expansion,  if  not  explosion,  of 
alternatives  that  may  parallel  the  expansion  of  the  topological 
indices  proposed  for  the  characterization  of  molecular 
structure-property-activity  relationships  and  introduction  of 
novel  matrices  for  chemical  graphs.  The  most  significant 
aspect  considered  in  this  contribution  may  turn  out  to  be  the 
data  reduction  step  when  a  large  number  of  input  data  are 
condensed  into  a  substantially  smaller  set  of  derived  param¬ 
eters.  This  important  aspect  of  DNA  data  analysis  has  only 
recently  received  some  attention,38"40  but,  in  view  of  the 
exponential  growth  of  the  automated  DNA  sequencing 
techniques,  the  problem  of  digesting  novel  information,  no 
doubt,  will  require  novel  ideas  that  go  beyond  just  listings 
of  nucleic  bases  of  a  primary  sequence.  The  construction  of 
sequence  “profiles”,  illustrated  in  this  report,  may  be  one 
way  of  data  reduction,  in  addition  to  the  recently  proposed 
grouping  of  data  for  different  nucleic  acids  separately,  which 
allow  large  (/?  x  n)  matrices  (where  n  can  run  into  the 
hundreds  or  the  thousands)  to  be  condensed  to  small  (4  x 
4)  matrices  where  the  rows  and  the  columns  are  associated 
with  the  four  nucleic  bases  A,  G,  C,  and  T.  Needless  to  say 
that  the  outlined  approach  is  suitable  for  characterization  of 
local  fragments  of  DNA,  which  is  precisely  how  one  may 
look  on  the  truncated  DNA  fragment  considered  in  this  work. 
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Appendix  1. 10  QSPR  modeling:  Graph  connectivity  indices 
versus  line  graph  connectivity  indices 
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Five  QSPR  models  of  alkanes  were  reinvestigated.  Properties  considered  were  molecular  surface-dependent 
properties  (boiling  points  and  gas  chromatographic  retention  indices)  and  molecular  volume-dependent 
properties  "(molar  volumes  and  molar  refractions).  The  vertex-  and  edge-connectivity  indices  were  used  as 
structural  parameters.  In  each  studied  case  we  computed  connectivity  indices  of  alkane  trees  and  alkane 
line  graphs  and  searched  for  the  optimum  exponent.  Models  based  on  indices  with  an  optimum  exponent 
and  on  the  standard  value  of  the  exponent  were  compared.  Thus,  for  each  property  we  generated  six  QSPR 
models  (four  for  alkane  trees  and  two  for  the  corresponding  line  graphs).  In  all  studied  cases  QSPR  models 
based  on  connectivity  indices  with  optimum  exponents  have  better  statistical  characteristics  than  the  models 
based  on  connectivity  indices  with  the  standard  value  of  the  exponent.  The  comparison  between  models 
based  on  vertex-  and  edge-connectivity  indices  gave  in  two  cases  (molar  volumes  and  molar  refractions) 
better  models  based  on  edge-connectivity  indices  and  in  three  cases  (boiling  points  for  octanes  and  nonanes 
and  gas  chromatographic  retention  indices)  better  models  based  on  vertex-connectivity  indices.  Thus,  it 
appears  that  the  edge-connectivity  index  is  more  appropriate  to  be  used  in  the  structure-molecular  volume 
properties  modeling  and  the  vertex-connectivity  index  in  the  structure- molecular  surface  properties  modeling. 

The  use  of  line  graphs  did  not  improve  the  predictive  power  of  the  connectivity  indices.  Only  in  one  case 
(boiling  points  of  nonanes)  a  better  model  was  obtained  with  the  use  of  line  graphs. 


INTRODUCTION 

This  study  was  motivated  by  two  recent  papers.  In  one 
Estrada  and  Rodriguez1  have  shown  that  the  edge-connectiv¬ 
ity  index  produced  the  best  single-variable  QSPR  models 
for  five  out  of  seven  physicochemical  properties  of  octanes. 
In  another  Gutman  et  al.2  have  reported  that  the  use  of  line 
graphs,  in  some  cases,  significantly  improves  the  predictive 
power  of  topological  indices.  We  decided  to  test  both  of  these 
results  by  using  them  to  reinvestigate  several  QSPR  models 
from  the  literature.  We  also  decided  to  test  further  the  result 
that  in  many  cases  the  optimum  exponent  of  the  vertex-  and 
edge-connectivity  indices  is  not  -0.5.3  Since  we  believe, 
along  with  many  others,4  that  the  QSPR  modeling  will 
become  the  tool  of  choice  for  many  chemists-at-large  in  times 
to  come,  it  seems  to  us  worthwhile  to  search  for  the  most 
reliable  framework  to  carry  out  this  kind  of  modeling.  The 
present  study  is  an  attempt  in  this  direction.  It  should  also 
be  noted  that  throughout  this  paper  we  will  use  the  chemical 
graph  theoretical  concepts  and  language5  only  to  simplify 
the  analysis. 

Recently,  line  graphs  have  been  increasingly  used  in 
structure-property  modeling,2*6-11  although  they  may  be 
traced  back  to  van’t  Hoff,  who  used  the  line  graphs  of  the 
structural  formulas  for  representing  simple  organic  com¬ 
pounds.  Line  graphs  are  described  in  a  monograph  on 
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chemical  graph  theory12  and  under  the  name  bond  graphs 
were  used  in  deriving  the  molecular  complexity  indices.13 
The  line  graph  L(G)  =  L  of  graph  G  is  a  graph  derived  from 
G  in  such  a  way  that  the  edges  in  G  are  replaced  by  vertexes 
in  L.  Two  vertexes  in  L  are  adjacent  if  the  corresponding 
two  edges  in  G  are  incident,  that  is,  have  a  vertex  in  common. 
The  construction  of  a  line  graph  from  a  tree  is  shown  in 
Figure  1. 

The  line  graph  L  is  usually  a  more  complex  structure  than 
the  corresponding  graph  G.  Only  in  the  case  of  unbranched 
cycloalkanes,  represented  by  cycles,  L  and  G  coincide 
because  in  cycles  the  number  of  vertexes  V  and  the  number 
of  edges  E  are  identical.  For  n-alkanes,  represented  by  the 
hydrogen-depleted  chains,  L  is  less  complex  than  G  because 
it  has  one  less  vertex  than  G,  since  in  chains  E  =  V  -  1. 

The  numbers  of  vertexes  V  and  edges  E  of  the  line  graph 
L  and  the  corresponding  graph  G  are  related  by 

V(L)  =  E(G)  (1) 

E(L)  =  (1/2)X42(G)-E(G)  (2) 

/ 

where  di  (i  =  1,  2, V)  are  degrees  of  vertexes  in  G.  These 
relations  can  be  easily  confirmed  by  inspecting  G  and  L 
depicted  in  Figure  1. 

Using  the  equation 

I  </,2( G)  =  Mx  (3) 

i 

where  M \  is  called14-16  the  first  Zagreb-group  index,17*18  and 
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Figure  1.  Construction  of  line  graph  L  from  tree  T  depicting  2,4-dimethylhexane. 


introducing  (3)  into  (2),  we  obtain  the  expression 

£(L)  =  (l/2)A/j  -  E( G)  (4) 

Gutman  and  Estrada  derived  the  same  expression,19  but  the 
factor  1/2  is  missing  in  their  expression.  From  (4)  follows 
an  amusing  result  that  the  M \  index  of  the  graph  is  simply 
equal  to  twice  the  number  of  vertexes  and  edges  in  the  line 
graph: 


M{  =  2[E(L)  +  E(G)]  =  2[E(L)  +  V(L)]  (5) 

SIMPLE  MODIFICATION  OF  THE  VALENCE  VERTEX- 
AND  EDGE-CONNECTIVITY  INDICES 

Vertex-Connectivity  Index.  The  standard  definition  of 
the  vertex-connectivity  index  is20 

x=  1  m  d(vj)r05  (6) 

edges 

where  d(vi)  is  the  degree  of  the  vertex  i/,-  and  [d(vi)  d(vj)]~ 05 
may  be  considered  as  the  weight  of  the  i-j  edge.21  The 
summation  in  (6)  goes  over  all  edges.  The  vertex  degree  d(vd 
is  equal  to  the  number  of  vertexes  adjacent  to  vertex  i  in  a 
graph  G.  Any  two  vertexes  in  G  are  adjacent  if  there  are 
edges  connecting  them. 

Equation  6  is  open  to  modification  because  the  choice  of 
edge  weights  [d(vi)  d{v}))~ 0,5  was  based  on  one  possible 
solution  to  the  inequalities  based  on  ordering  graphs.20  There 
are  also  other  choices  of  weights  possible.  Hence,  the 
quantity  [d(Vi)  d(vf))~ 05  can  be  replaced  by  [ d(v ,)  d(vj)]kt 
where  A:  is  a  variable  exponent  that  can  be  varied  in  any 
desired  range  of  values,  and  (6)  becomes3 

X  =  E  [d(u^  d(Vj)]k  k*  0  (7) 

edges 

Edge-Connectivity  Index.  The  standard  definition  of  the 
edge-connectivity  index  is  similar  to  the  definition  of  the 
vertex-connectivity  index,  the  only  change  being  in  using 
the  edge  degrees  </(*,-)  instead  of  vertex  degrees  d(vi):22 

e=  I  [d{e)  d(ej)]-0  i  (8) 

adjacent  edges 

The  edge  degree  d(ei)  is  equal  to  the  number  of  edges 
adjacent  to  edge  /  in  a  graph  G.  Any  two  edges  in  G  are 
adjacent  if  they  meet  at  the  same  vertex.  Because  every  edge 
in  G  connects  two  vertexes,  the  edge  degree  d{e)  can  be 
expressed  in  terms  of  their  degrees  as  follows:22 

d(e)  =  d(Vi)  +  d(vj)  -  2  (9) 

This  expression  can  be  used  to  assign  the  degrees  of  edges 
in  G.  In  Figure  2  we  give  the  vertex  and  edge  degrees  in 


2  2 
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L  L 

Figure  2.  Vertex  degrees  (digits  at  each  vertex)  and  edge  degrees 
(digits  at  each  edge)  in  tree  T  and  the  corresponding  line  graph  L 
from  Figure  1. 

tree  T  and  the  corresponding  line  graph  L  depicted  in  Figure 

1. 

A  simple  way  to  assign  the  degrees  to  edges  in  graph  G 
or  its  line  graph  L  is  to  count  all  adjacent  bonds  of  a  bond 
for  which  we  wish  to  determine  the  edge  degree.  This 
procedure  is  illustrated  in  Figure  3. 

Equation  8  can  also  be  modified  because  the  quantity  [d(ei) 
d(€j)]~os  was  the  result  of  mimicking  the  original  definition 
of  Randic  for  the  vertex-connectivity  index.20  Consequently, 
[d(ei)  d(ej)]~05  can  be  replaced  by  [d(e)  d{e})]kt  where  k  is  a 
variable  exponent  that  can  be  varied  in  any  desired  range  of 
values.  Thus,  (8)  converts  into  the  following  equation: 

e=  X  W  d(ej)]k  k*  0  (10) 

adjacent  edges 

At  this  point  it  should  also  be  noted  that  the  edge- 
adjacency  matrix23  of  the  graph  G,  EA(G),  is  identical  to  the 
vertex-adjacency  matrix23  of  the  line  graph  L  of  G,  VA(L): 

EA(G)  =  VA(L)  (11) 

This  must  be  so  because  the  edge  degrees  in  G  are  identical 
to  the  vertex  degrees  in  the  corresponding  line  graph  L  (see 
Figure  2).  The  consequence  of  (11)  is  that  the  edge- 
connectivity  index  of  G  is  identical  to  the  vertex-connectivity 
index  of  the  corresponding  line  graph  L:19 

e(G)  =  *(L)  (12) 

RESULTS  AND  DISCUSSION 

We  studied  five  structure-property  models  that  were 
already  reported  in  the  literature.  This  was  done  on  purpose 
because  our  aim  was  to  compare  the  performance  of  the 
obtained  models  with  those  already  published.  The  properties 
considered  were  boiling  points  of  octanes  and  nonanes  and 
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(1)  Tree  T  and  its  line  graph  L 


Figure  3.  A  simple  procedure  for  assigning  the  degrees  to  edges 
in  tree  T  and  the  related  line  graph  L. 

molar  volumes,  molar  refractions,  and  retention  indices  of 
alkanes.  Boiling  points  and  retention  indices  are  typical 
“surface”-dependent  properties,  while  molar  volumes  and 
molar  refractions  are  “molecular  volume”-dependent  proper¬ 
ties.  In  all  cases  molecules  were  depicted  as  graphs  and 
corresponding  line  graphs.  The  standard  deviation  S  was  used 
as  a  criterion  for  the  comparison  of  the  models.  The  optimum 
parameter  k  in  (7)  and  (10)  was  determined  using  the 
procedure  described  in  our  earlier  report;3  that  is,  the 
parameter  k  was  taken  to  be  optimum  when  the  value  of  S 
reached  a  minimum. 

Boiling  Points  of  18  Octanes.  We  first  considered 
structure-boiling  point  models  for  isomeric  octanes,  on  the 
basis  of  their  vertex-connectivity  indices  computed  for  octane 
trees.  The  best  model  was  obtained  for  k  =  -1.15.  The 
regression  equation  is  given  by 

bp  =  65.14(±7.29)  +  28.87(±4.31)xM'51  (13) 

n  =  18  R  =  0.859  5  =  3.24  F  =  45 

where  bp  is  the  normal  boiling  point,  R  the  correlation 
coefficient,  S  the  standard  deviation,  F  the  Fisher  ratio,  and 
a  short-hand  notation  for  the  vertex-connectivity  index 
computed  using  the  value  of  -1.15  for  the  exponent  in  (7). 
The  notation  %[k]  will  be  used  throughout  this  paper.  The 
improvement  over  the  model  based  on  k  =  -0.5  is  rather 
slight: 

bp  =  3.14(±  19.23)  +  30.33(±5.27)^"0-501  (14) 

n  =  18  /f  =  0.821  5=3.60  F  =  33 


The  above  models  are  identical  to  structure-boiling  point 
models  for  octanes  published  elsewhere.3,24  Randic  et  al.25 
have  also  observed  that  the  modified  vertex-connectivity 
index  produces  better  structure-boiling  point  models  of 
lower  (C2-C7)  alkanes  than  the  standard  version  of  the 
vertex-connectivity  index.  However,  they  have  found  that 
the  exponent  value  of  -0,33  leads  to  the  best  models  of  three 
alternatives  they  considered  ( k  =  -0.5,  -0.33,  -0.25). 

The  same  analysis  as  with  the  vertex-connectivity  index 
was  also  carried  out  with  the  edge-connectivity  index.  The 
best  model  was  obtained  for  k  =  -0.30.  The  regression 
equation  is  given  by 

bp  =  179.75(±  11.12)  -  13.66(±2.30)€1"0-301  (15) 
n  =  18  R=  0.830  5  =  3.52  F  =  35 

where  e1-0,3^  is  a  short-hand  notation  for  the  edge-connectiv¬ 
ity  index  computed  using  the  value  of  -0.30  for  the  exponent 
in  (10).  The  notation  €[k]  will  be  used  throughout  this  paper. 
The  improvement  over  the  model  based  on  k  =  -0.5  is 
considerable 

bp  =  162.76(±29.94)  -  14.52(±8.85)e['0-501  (16) 
n  =  18  R  =  0.379  5  =  5.84  F  =  3 

but  the  model  in  (15)  is  not  as  good  as  the  model  in  (13), 
though  it  is  somewhat  better  than  the  model  in  (14).  This 
result  supports  the  work  by  Estrada  and  Rodriguez,1  because 
one  of  the  two  physicochemical  properties  of  octanes  for 
which  the  use  of  the  edge-connectivity  index  did  not  produce 
the  best  single-variable  QSPR  model  was  the  boiling  point, 
the  other  being  the  heat  of  vaporization.  Estrada  and 
Rodriguez  pointed  out  that  to  describe  these  properties 
correctly  it  is  necessary  to  take  into  account  long-range 
contributions  in  the  edge-connectivity  index.9  In  both  these 
cases  better  single-variable  models  were  obtained  using  the 
Hosoya  Z  index.26 

Finally,  we  considered  octane  line  graphs.  Since  ^W(L) 
=  €(*](G),  we  derived  structure-boiling  point  models  based 
on  the  edge-connectivity  index  e[fc](L).  The  best  model  was 
obtained  for  k  =  -0.675.  The  regression  equation  is  given 
by 

bp  =  167.56(±9.03)  -  20.17(±3.37)el'°'6751(L)  (17) 

n  =  18  R  =  0.831  5=3.51  F  =  36 

where  €[~°-675](L)  is  a  short-hand  notation  for  the  edge- 
connectivity  index  computed  for  a  line  graph  using  the  value 
of  -0.675  for  the  exponent  in  (10).  This  notation  will  be 
used  throughout  this  paper  when  the  models  based  on  line 
graphs  and  edge-connectivity  indices  are  discussed. 

The  model  in  (17)  is  practically  the  same  as  the  model  in 
(15)  on  the  basis  of  octane  trees  and  the  edge-connectivity 
index.  The  improvement  over  the  model  based  on  k=  -0.5 
is  visible: 

bp  =  138.83(±5.80)  -  6.1  l(±1.39)cl"°'501(L)  (18) 
n  =  18  R  =  0.740  5  =  4.24  F  =  19 

However,  this  model  is  much  better  than  the  corresponding 
model  in  (16)  on  the  basis  of  octane  trees. 
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Boiling  Points  of  35  Nonanes.  The  same  kind  of  analysis 
as  in  the  case  of  modeling  boiling  points  of  octanes  is  carried 
out  for  nonanes.  We  first  considered  structure-boiling  point 
models  for  isomeric  nonanes,  on  the  basis  of  their  vertex- 
connectivity  indices  computed  for  nonane  trees.  The  best 
model  was  obtained  for  k  =  - 1.25.  The  regression  equation 
is  given  by 

bp  =  94.23(±6.68)  +  25.58(±3.97)xM'25)  (19) 

n  =  35  R  =  0.746  S  =  4.13  F  =  41 

The  improvement  over  the  model  based  on  k  =  -0.5  is  again 
rather  slight: 

bp  =  3 1.47  (±19. 64)  +  25.67(±4.77)x1"0'501  (20) 
n  =  35  R  =  0.683  5=4.53  F  =  29 

The  above  models  are  comparable  to  the  structure-boiling 
point  models  for  nonanes  published  elsewhere.3,27  The  same 
analysis  was  also  carried  out  with  the  edge-connectivity 
index.  The  best  model  was  obtained  for  k  =  -0.375.  The 
corresponding  regression  equation  is 

bp  =  225.36(±  16.39)  -  1 8.42(±3.41  )€,-°  3751  (21) 
n  =  35  R  =  0.685  5  =  4.52  F  =  29 

This  model  and  the  model  in  (20)  are  practically  the  same. 
However,  it  is  worse  than  the  model  in  (19).  The  improve¬ 
ment  over  the  model  based  on  k  =  -0.5  is  considerable: 

bp  =  218.35(±30.12)  -  2 1 .3 1  (±7.89)el_0 501  (22) 
n  =  35  R  =  0.426  5  =  5.61  F=1 

Finally,  we  considered  nonane  line  graphs.  The  best  model 
was  obtained  for  k  =  -0.70: 

bp  =  203.18(±9.60)  -  22.96(±3.32)el~0J0I(L)  (23) 
7i  =  35  R  =  0.769  5=  3.97  F  =  48 

This  model  is  better  than  any  regarding  the  relationship 
between  structures  and  boiling  points  of  nonanes.  It  repre¬ 
sents  an  improvement  over  the  model  based  on  k  =  -0.5: 

bp  =  161.56(±5.94)  -  22.96(db 3 .32)€l_0  S0J(L)  (24) 

71  =  35  R  =  0.587  5=  5.03  F=17 

Comparison  between  this  model  and  the  related  models  based 
on  octane  trees  shows  that  the  model  in  (24)  is  not  as  good 
as  the  model  in  (20),  but  better  than  the  model  in  (22). 

In  this  example,  the  edge-connectivity  index  did  live  up 
to  the  expectations  based  on  the  work  by  Gutman  et  al.:2 
The  use  of  the  line  graph  edge-connectivity  index  produced 
for  nonanes  the  best  structure- boiling  point  model.  However, 
the  model  in  (23)  is  still  far  from  being  satisfactory  in 
comparison  with  models  that  use  several  topological  indi¬ 
ces.28  For  example,  the  best  structure-boiling  point  model 
for  nonanes  with  five  descriptors  has  R  =  0.981  and  S  = 
0,89. 29 

Gas  Chromatographic  Retention  Indices  of  Alkanes. 
The  same  methodology  as  above  was  applied  to  the  relation¬ 
ship  between  the  structures  of  alkanes  and  their  gas 
chromatographic  retention  indices.30  We  first  considered 


structure-chromatographic  retention  data  correlation  for  the 
first  157  alkanes  using  as  the  structural  parameter  the  vertex- 
connectivity  index.  The  best  correlation  was  obtained  for  k 
=  -0.325: 

RI  =  74.58(db 8.48)  +  148.14(±  1 .53 )^I_0'325J  (25) 
n  —  157  R  =  0.992  5=23.8  F  =  9330 

where  RI  stands  for  the  retention  indices  of  alkanes.  This 
model  gives  a  very  good  agreement  between  experimental 
and  computed  retention  indices  of  alkanes.  Retention  indices 
of  alkanes  cover  a  range  from  Rl(methane)  =  100  to  RI- 
(2,3-dimethylundecane)  =  1251.4.  In  most  cases  the  differ¬ 
ence  between  experimental  and  computed  values  is  less  than 
3%. 

The  model  in  (25)  is  only  slightly  better  than  the  model 
based  on  k  =  -0.5: 

RI  =  64.92(±9.38)  +  187.97(±2.13)*1_0  501  (26) 

71=157  R  =  0.990  5=  26.0  F  =  7801 

The  use  of  the  edge-connectivity  index  produced  poorer 
models: 

RI  =  137.98(±  13.79)  +  200.54(±3.66)eM'55)  (27) 

7!  =  157  7?  =  0.975  5  =  41.3  F=  3008 

RI  =  1 34.0(±  14.55)  +  184.89(±3.54)e1"0-501  (28) 
71=157  R  =  0.973  5  =  43.2  F  =  2729 

These  two  models  are  comparable,  but  are  much  better  than 
models  based  on  alkane  line  graphs  and  their  edge-con¬ 
nectivity  indices: 

RI  =  206.58(±21.72)  +  262.94(± 8 .30)el_o  775,(L)  (29) 
7i  =  157  R  =  0.931  5  =  68.2  F=1003 

RI  =  365.44(±36.63)+104.00(±7.24)el'°50J(L)  (30) 
71=157  R  =  0.756  5=  122.2  F  =  206 

There  are  several  structure-chromatographic  retention 
index  correlations  for  alkanes  available  in  the  literature.30 
Most  of  them  are  based  on  the  two-dimensional  and  three- 
dimensional  Wiener  numbers.  However,  there  is  also  a 
correlation  available  based  on  the  vertex-connectivity  index 
with  k  =  -0.5  which  differs  only  slightly  from  (26):30 

RI  =  69.8 1  (±9.3 1 )  +  186.93(±2.11)£,~0  501  (31) 
71=157  R  =  0.990  5=  26.0  F=  7827 

The  initial  work  on  the  structure-chromatographic  reten¬ 
tion  data  correlations  is  due  to  Randic.31  The  correlations 
based  on  the  two-dimensional  (2W)  and  three-dimensional 
CW)  Wiener  numbers,  which  are  adjusted  Walker-type 
correlations,32  are  not  as  good  as  the  model  in  (25):30 

RI  =  171 .2(±  1 5.7)  2wo.335(±o.oi3)  _  48 ,6(±27.3)  (32) 

71=157  R  =  0.984  5  =  33.0  F=  2403 

RI  =  170. 6(±  17.0)  3iy°'325(±0'013'  _  3 1 ,8(± 30.2)  (33) 

71=157  F  =  0.982  5  =  35.6  F=  2048 
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These  models  are,  however,  better  than  the  ones  based  on 
edge-connectivity  indices  computed  for  either  alkane  trees 
or  alkane  line  graphs.  The  best  overall  structure -chromato¬ 
graphic  retention  data  correlation  is  obtained  with  the  vertex- 
connectivity  index  with  k  =  -0.325  (model  in  (25)).  This  is 
to  our  knowledge  the  best  structure-chromatographic  reten¬ 
tion  data  model  of  alkanes  that  exists  in  the  literature. 

Molar  Volumes  of  Alkanes.  We  considered  molar 
volumes  of  69  lower  alkanes  taken  from  Estrada.22  We  first 
considered  the  structure-molar  volume  relationship  using 
the  vertex-connectivity  index.  The  best  correlation  was 
obtained  for  a  rather  small  value  of  k  (-0.07).  The  regression 
equation  and  the  statistical  parameters  for  the  correlation  are: 

MV  =  55.85(±2.10)  +  16.53(±0.32)x‘'0071  (34) 
n  =  69  7?  =  0.988  5  =  2.73  F=  2649 

where  MV  stands  for  molar  volume.  This  regression  is  better 
as  expected  than  the  one  based  on  the  standard  value  of  k 
(-0.50): 

MV  =  53.07(±4.41)  +  29.60(±  1 .1 8)xI_0  501  (35) 
n  =  69  R  =  0.951  5  =  5.38  F=632 

These  models  are  inferior  to  those  based  on  the  edge- 
connectivity  index.  The  best  structure-molar  volume  model 
was  obtained  for  k  =  -0.515: 

MV  =  57.44(±1.37)  +  31.80(±0.41)€1-0-5151  (36) 
n  =  69  R  =  0.995  5=  1.81  F  =  6094 

This  model  is  only  very  slightly  better  than  the  model  based 
on  the  standard  value  of  the  exponent  k: 

MV  =  58.23(±1.41)  +  30.80(±0.41)et_°'501  (37) 
n  =  69  R  =  0.994  5  =  1.88  F  =  5669 

Equation  37  is  different  from  the  corresponding  one  given 
by  Estrada22  as  (1)  in  his  paper.  The  difference  is  caused  by 
the  use  of  erroneous  values  of  the  edge-connectivity  indices 
for  six  alkanes  in  Table  1  in  Estrada’s  paper.  The  correct 
values  are  (we  use  the  same  codes  for  alkanes  as  Estrada): 
(33ME5)  -3.1 160,  (233MMM5)  -3.2832,  (33ME6)  -3.6766, 
(234MMM6)  -3.7921,  (244MMM6)  -3.8432,  and 

(334MMM6)  -3.7107.  The  model  in  (37)  is  in  fact  better 
than  the  model  given  in  Estrada’s  paper  (statistical  parameters 
for  Estrada’s  structure- molar  volume  model  with  six  incor¬ 
rect  values  of  edge-connectivity  indices  are  R  =  0.993,  S  = 
2.034,  and  F  =  4822). 

The  statistical  characteristics  of  models  based  on  the  edge- 
connectivity  index  also  support  the  work  by  Estrada  and 
Rodriguez,1  because  one  of  the  five  physicochemical  proper¬ 
ties  of  octanes  for  which  the  use  of  the  edge-connectivity 
index  produced  the  best  single-variable  QSPR  model  was 
the  molar  volume.  This  also  agrees  with  analyses  which  point 
out  that  the  edge-connectivity  index  is  more  appropriate  to 
be  used  in  the  structure- molecular  volume  properties 
modeling  than  the  vertex-connectivity  index. 

The  structure-molar  volume  models  based  on  line  graphs 
and  edge-connectivity  indices  possess  rather  inferior  statisti¬ 
cal  parameters  than  the  models  shown  above: 
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MV  =  111.14(±5.76)  +  12.45(±1.35)el“°-501(L)  (38) 

n  =  69  R  =  0.748  5=  11.54  F  =  85 

MV  =  67.1 2(±3.59)  +  44.49(±  1.67)cl"0'7751(L)  (39) 
n  =  69  R  =  0.956  5  =  5.09  F=714 

Molar  Refractions  of  Alkanes.  We  considered  molar 
refractions  of  69  lower  alkanes  also  taken  from  Estrada,22 
Among  the  reported  experimental  values  one  is  incorrect: 
Molar  refraction  of  34MM6  is  38.8453  instead  of  43.6870.33 
We  first  considered  the  structure- molar  refraction  relation¬ 
ship  using  the  vertex-connectivity  index.  The  best  correlation 
was  obtained  again  for  a  rather  small  value  of  k  (-0.02). 
The  regression  equation  and  the  statistical  parameters  for 
the  correlation  are 

MR  =  6.99(±0.15)  +  4.70(±0.02)*l-°-021  (40) 

n  =  69  R  =  0.9993  5  =  0.200  F  =  46865 

where  MR  is  a  short-hand  notation  for  molar  refraction.  This 
regression  equation  is  better  than  the  one  based  on  the 
standard  value  of  k  (-0.50): 

MR  =  5.76(±  1 .88)  +  9. 1 1  (±0.32)^[_0  501  (41) 

n  =  69  R  =  0.962  5=  1.45  F  =  824 

The  model  in  (40)  is  better  than,  and  the  model  in  (41)  is 
worse  than,  the  corresponding  models  based  on  the  edge- 
connectivity  index.  The  best  structure-molar  refraction 
model  using  edge-connectivity  indices  was  obtained  for  k 
=  -0.495: 

MR  =  7.77(±0.50)  +  9.26  (±0.14)el"°'495)  (42) 

n  =  69  R  =  0.992  5=  0.668  F  =  4130 

There  is  hardly  any  difference  between  this  model  and  the 
model  based  on  the  standard  value  of  exponent  k: 

MR  =  7.71(±0.50)  +  9.36(±0.15)eI_0'5#)  (43) 

n  =  69  R  =  0.992  5  =  0.672  F  =  4090 

Equation  43  is  different  from  the  corresponding  one  given 
by  Estrada22  as  (2)  in  his  paper.  The  difference  is  caused  by 
erroneous  values  of  the  edge-connectivity  indices  for  six 
alkanes  (see  the  discussion  above).  The  model  in  (43)  is  a 
little  better  than  the  model  in  the  Estrada  paper  when  the 
corrected  values  of  the  edge-connectivity  indices  are  used. 
We  also  carried  out  the  statistical  analysis  of  Estrada’s 
structure-molar  refraction  model  with  six  incorrect  values 
of  edge-connectivity  indices  and  obtained  different  statistical 
parameters  ( R  =  0.983,  S  -  0.964,  and  F  =  1969)  from 
those  reported  ( R  =  0.9913,  S  =  0.698,  and  F  =  3782). 

The  model  in  (43),  being  better  than  the  model  in  (41), 
supports  the  claim  by  Estrada  and  Rodriguez1  regarding 
modeling  the  molar  refraction.  In  their  work  one  of  the  five 
physicochemical  properties  of  octanes  for  which  the  use  of 
the  edge-connectivity  index  produced  the  best  single-variable 
QSPR  model  was  also  the  molar  refraction.  However,  when 
the  models  based  on  vertex-  and  edge-connectivity  indices 
with  variable  exponents  are  considered,  the  reverse  is  true: 
the  model  in  (40)  is  better  than  the  model  in  (42).  The  model 
in  (40)  is  also  better  than  the  model  in  (43). 
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Structure- molar  refraction  models  based  on  alkane  line 
graphs  and  edge-connectivity  indices  are  again  as  in  the  case 
of  structure-molar  volume  models  inferior  to  models  based 
on  connectivity  indices  computed  for  alkane  trees: 

MR  =  1 1.78(±  1.29)  +  1 2.30(±0.56)el_0'751(L)  (44) 

n  =  69  R  =  0.936  5=1.861  F  =  474 

MR  =  23.29(±1.68)  +  3.91(±O.39)e[_o'501(L)  (45) 
n  =  69  R  =  0.772  5  =  3.358  F  =  99 

Model  (40),  in  which  the  value  of  the  exponent  is  rather 
low,34  supports  the  use  of  the  structure-molecular  refraction 
model  based  on  the'simplest  possible  topological  index,  the 
number  of  carbon  atoms  V: 

MR  =  2.60  (±0.18)  +  4.55  (±0.02)  V  (46) 
n  =  69  R  =  0.999  S  =  0.208  F  =  43200 

CONCLUSIONS 

We  investigated  five  structure- property  models  of  al¬ 
kanes.  The  properties  considered  were  molecular  surface- 
dependent  properties  (boiling  points  and  gas  chromatographic 
retention  indices)  and  molecular  volume-dependent  properties 
(molar  volumes  and  molar  refractions).  Alkanes  were 
represented  by  trees  and  the  corresponding  line  graphs.  The 
vertex-  and  edge-connectivity  indices  were  used  as  structural 
parameters.  In  each  studied  case  we  computed  connectivity 
indices  with  an  optimum  exponent  and  with  a  standard  value 
of  -0.5.  In  total  we  generated  six  QSPR  models  for  each 
property.  The  obtained  results  lead  us  to  conclude  the 
following. 

(i)  In  all  cases  QSPR  models  based  on  connectivity  indices 
with  optimum  exponents  have  better  statistical  parameters 
than  the  models  based  on  connectivity  indices  with  the 
standard  value  of  the  exponent  (-0.5).  This  is  fully  in 
agreement  with  our  earlier  study3  and  the  ideas  of  Alten- 
burg,35  Randic  et  al.,25  and  Estrada.36  Therefore,  we  suggest 
that  the  modified  versions  of  vertex-  and  edge-connectivity 
indices  should  be  routinely  employed  in  the  structure- 
property  modeling  rather  than  the  standard  versions  of  the 
connectivity  indices. 

(ii)  In  the  five  cases  that  we  studied  the  structure-boiling 
point  models  for  octanes  and  nonanes  and  the  structure- 
chromatographic  retention  index  model  for  alkanes  based  on 
vertex-connectivity  indices  are  better  than  the  corresponding 
models  based  on  edge-connectivity  indices.  Thus,  it  appears 
that  the  vertex-connectivity  index  is  more  appropriate  to  be 
used  in  the  structure- molecular  surface  properties  modeling 
than  the  edge-connectivity  index.  Consequently,  the  vertex- 
connectivity  index  may  be  considered  as  a  molecular  surface 
descriptor. 

(iii)  In  the  five  cases  that  we  studied  the  structure-molar 
volume  and  the  structure-molar  refraction  models  for  C5- 
C9  alkanes  based  on  the  edge-connectivity  index  produced 
the  best  single-variable  model.  This  agrees  with  the  findings 
of  Estrada  and  Rodriguez1  and  is  suggestive  that  the  edge- 
connectivity  index  is  the  better  descriptor  to  be  used  in  the 
structure-molecular  volume  properties  modeling  than  the 
edge-connectivity  index.  Thus,  the  edge-connectivity  index 
may  be  regarded  as  a  molecular  volume  descriptor.  The  edge- 
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connectivity  index  appears  to  be  a  promising  molecular 
descriptor, 1>10i37_39  especially  if  the  long-range  contributions 
to  this  index  are  included  in  the  modeling.9,11 

(iv)  The  use  of  line  graphs  in  this  study  did  not  improve 
the  predictive  power  of  the  connectivity  indices.  Only  in  the 
case  of  structure-boiling  point  modeling  for  nonanes  the 
model  based  on  the  nonane  line  graphs  produced  the  best 
model  among  the  possibilities  considered.  Since  the  con¬ 
struction  of  the  line  graphs  is  not  difficult  and  the  computa¬ 
tion  of  their  descriptors  can  be  easily  carried  out,  it  is  also 
reasonable  to  use  them  in  the  QSPR  modeling,  but  to 
establish  the  usefulness  of  the  line  graph  model  in  the 
structure-property  studies,  more  work  is  needed. 
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Many  chemicals  are  known  to  be  toxic  to  living  organisms,  inducing  mutations  and  deletions  at  the 
chromosomal  and  genetic  level.  One  of  the  tasks  in  risk  assessment  of  genotoxic  chemicals  is  to  devise  a 
simple  numerical  descriptor  which  may  be  used  to  quantify  the  relationship  between  chemical  dose  and  the 
effect  on  the  genetic  sequences.  We  have  developed  numerical  descriptors  to  characterize  different  DNA 
sequences  which  are  especially  useful  in  sequence  comparisons.  These  descriptors  have  been  developed 
from  a  graphical  representational  technique  that  enables  easy  visualization  of  changes  in  base  distributions 
arising  from  evolutionary  or  other  effects.  In  this  paper  we  propose  a  scheme  to  use  these  descriptors  as  a 
label  to  help  quantify  the  potential  risk  hazard  of  chemicals  inducing  mutations  and  deletions  in  DNA 
sequences. 


INTRODUCTION 

The  deleterious  effects  of  many  chemicals  and  newly 
synthesized  compounds  on  human  and  environmental  health 
is  of  serious  concern.  Many  of  these  chemicals  are  known 
to  pass  through  cell  barriers  and  cause  mutations  and 
deletions  in  DNAs.  Recent  studies  have  demonstrated  how 
many  common  chemicals  cause  such  effects:  exposure  to 
common  environmental  chemicals  such  as  nitropyrenes 
present  in  diesel  exhausts  cause  mutations  and  homologous 
recombinations  in  DNAs  leading  to  carcinogenesis;1,2  some 
polycyclic  aromatic  hydrocarbons  from  coal  burning  for 
industry  and  home  heating  form  DNA  adducts  that  have  been 
shown  to  act  as  transplacental  carcinogens  and  developmental 
toxicants3  or  induce  mutations  at  the  GC  and  the  AT  base 
pairs  of  the  hrpt  genes;4  other  chemicals  such  as  ethylni- 
trosourea  and  ethyl  methanesulfonates  have  been  shown  to 
induce  mostly  transition  types  of  mutations  in  DNAs  leading 
to  chromosomal  aberrations.5  A  carbonyl  compound,  ac¬ 
rolein,  present  in  the  environment  as  commonly  used 
industrial  chemicals,  natural  products,  environmental  con¬ 
taminants  and  products  of  endogenous  metabolism  in  human 
beings,  has  been  found  to  cause  mutations  and  intrastrand 
cross-links  between  guanine  residues,6  and  similar  effects 
of  many  other  compounds  are  known  in  the  literature  (see, 
e.g.,  refs  7  and  8).  DNA  damage  is  also  induced  by  excesses 
of  heavy  metals  such  as  Rh9  and  Cu(II),10,H  which  prefer¬ 
entially  induce  depletion  of  guanine  residues.  Table  1  gives 
a  brief  list  of  some  of  the  data  available  in  recent  literature 
on  effects  of  chemical  substances  on  DNA  sequences. 

One  of  the  prime  tasks  in  risk  assessment  of  these  and 
other  chemicals  and  ions  is  to  define  one  or  more  numerical 
descriptors  of  the  chemical  dose  and  the  measured  effect. 
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Much  of  the  data  to  date,  however,  consist  of  measures  of 
types  of  mutations  and  deletions  observed  in  specific  genes 
at  various  levels  of  chemical  dosages,  and  much  of  it  is  order 
of  magnitude  indications  of  genetic  risk.8  While  some 
chemicals  would  induce  mutations  and  deletions  at  sites  with 
specific  base  pair  combinations,  others  could  lead  to  oxida¬ 
tive  damages  and  mutations  at  random  at  intragenic  and 
intergenic  segments  including  point  mutations  and  small 
deletions.  Techniques  of  unbiased  measures  of  such  alter¬ 
ations  in  a  DNA  sequence  from  a  set  of  numerical  descriptors 
would  be  essential  in  assessing,  in  a  universal  and  standard 
manner,  the  risk  potential  of  such  chemicals  and  form  a  vital 
link  in  integrating  pharmacokinetics  and  mutational  studies. 

In  this  paper  we  outline  such  a  measure  arising  from 
descriptors  of  DNA  sequences  of  any  specified  length  and 
show  that  small  changes  due  to  random  point  mutations  or 
deletions  in  such  sequences  can  be  quantified  for  scaling 
purposes.  It  has  developed  out  of  a  technique  for  graphical 
representation  of  DNA  sequences  but  can  now  be  done 
rapidly  and  accurately  using  computer  programs  bypassing 
the  graphical  stage  altogether. 

METHOD 

The  fundamental  basis  of  our  proposed  quantitative 
descriptor  is  analysis  of  base  distribution  in  a  sequence  by 
taking  a  running  account  of  compositional  differences  in  pairs 
of  bases,  e.g.  intra-purines  and  intra-pyrimidines,  as  we  read 
down  the  sequence  from  the  5'-  to  the  3'-end.  This  is  most 
easily  visualized  in  terms  of  a  two-dimensional  graphical 
representation  described  below.  Since  the  method  depends 
on  small  differences  between  the  numbers  of  bases  present 
in  the  sequences,  it  is  very  sensitive  to  small  changes  in  base 
composition  and  distribution  patterns. 

The  method  of  representing  DNA  sequences  graphically 
using  a  two-dimensional  Cartesian  coordinate  system  has 
been  explained  elsewhere.12,13  The  shapes  of  these  DNA 
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Table  1.  Effects  of  Different  Chemicals  on  DNA  Sequences  (Recent  Studies)" 
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mutation  composition 


substitutions  (%) 


chemical 

DNA  sample 

deletions  (%) 

transitions 

transversions 

refs  and  remarks 

acrolein 

SupF  gene 

24 

21 

55  (-GC  to  TA) 

4 

elhylnitrosourea 

lacZ 

5 

43  (-GC  to  AT) 

52  (-AT  toTA) 

5 

ethylmethanosulfonate 

lacZ 

8 

74  (— GC  to  AT) 

18  (GC  toTA) 

5 

heavy  metals— Rh 

oligomeric  DNA 

100  (5'-G  deleted  in 

9  (long-range  electron  transfer) 

duplexes 

5'GG-3'  doublets) 

5-nitroimidazoles 

Bacteroides  fragilis 

100  (majority  C  to  G, 
CG  to  AT) 

7 

1,3-butadiene 

Various— in  mice, 

8  genetic  hazard  exists 

polycyclic  aromatic 

rat,  humans 

at  permitted  concns 
mutation  data  not  available 

hprt  gene 

-25 

-55 

4 

hydrocarbons 

a  Notes:  The  GC  to  AT*  implies  that  the  majority  transitions  are  of  the  GC  to  AT  type,  etc.  Acrolein  is  one  of  the  a,b-unsaturated  carbonyl 
compounds  present  in  the  environment.  Nitroimidazoles,  Metronidazole  and  dimetridazole  are  used  in  treatment  of  intraabdominal,  pulminory,  and 
brain  abscesses  and  other  diseases.  1,3-Butadiene  is  widely  used  in  the  petroleum  industry. 


graphs  depend  on  the  base  distribution  in  the  sequence.  The 
plot  of  a  typical  representation  is  generated  by  moving  one 
step  in  the  positive  x-direction  for  a  guanine  (G)  in  the 
sequence,  the  negative  x-direction  for  an  adenosine  (A),  the 
positive  y-direction  for  a  cytosine  (C),  and  the  negative 
y-direction  for  a  thymine  (T),  the  succession  of  such  steps 
producing  a  graphical  shape  characteristic  of  the  sequence. 
This  essentially  plots  the  progressive  differences  in  the 
instantaneous  individual  totals  of  guanine  and  adenosine 
along  the  x-axis  (i.e.  nG  —  nA)  and  of  cytosine  and  thymine 
along  the  y-axis  (i.e.,  nc  —  nT  );  two  other  sets  of  axes  can 
be  similarly  defined  for  a  complete  representation,  but  we 
use  the  one  described  here  as  the  default  axes  system.  We 
have  shown12  that  for  conserved  genes  such  plots  are  shape 
similar  thereby  making  identification  of  a  new  sequence  of 
the  gene  family  possible  rapidly  and  easily  by  visual 
inspection  alone;  elsewhere  we  have  shown  that  one  can  read 
off  base  preferences  and  local  abundances  directly  from  the 
shape  of  these  graphs14  or  identify  coding  and  noncoding 
regions  of  the  sequences.15  Changes  in  base  distribution  and 
composition  induce  changes  in  the  visual  plots  of  the  DNA 
sequences;  for  the  same  genes  for  different  species  we  have 
noticed  systematic  drifts  in  the  sequence  pattern  which  have 
been  attributed  to  evolutionary  changes.16 

Differences  in  the  plots  of  a  family  of  genes  can  be 
quantitatively  assessed.17  This  method  consists  essentially 
of  defining  a  set  of  moments  of  the  graph  points  around  the 
origin  of  the  plot.  In  the  first  order  we  define  quantities  pi\- 
(x)  and  (y)  which  are  the  sum  of  the  x-  and  y-coordinate 
values  of  each  point  averaged  by  the  total  number  of  points 
in  the  distribution.  One  can  then  define  a  graph  radius  for 
each  plot 

gR  =  IW*))2  +  C«iCy))2]1/2 

and  correspondingly  a  distance  measure  between  two  graphs: 
d(s,s')  =  [(«,(*)- /^(x))2  +  (Mi(y)  -f*i(y'))2]m 

where  s  and  s'  represent  the  two  graphs.  We  have  observed17 
that  small  differences  in  DNA  sequences  arising  out  of  base 
mutations  and  deletions  manifest  themselves  in  observable 
changes  in  gR  and  d.  We  propose  to  use  the  gR  as  one 


numerical  descriptor  of  a  sequence  and  deviation  from  gR, 
A#r,  as  a  measure  of  the  changes  in  a  sequence  as  a 
consequence  of  genotoxic  effects  of  chemicals.  For  greater 
precision,  one  could  also  use  a  set  of  fx i(x),  ju\(y),  and  gR  as 
numerical  descriptors  of  a  DNA  sequence. 

RESULTS  AND  DISCUSSIONS 

As  a  preliminary  exploration  of  this  technique,  we  have 
used  the  complete  human  globin  gene  sequence  (from  the 
HSHBB  sequence  of  the  EMBL  DNA  database  rel  31), 
inclusive  of  the  introns  and  exons,  as  the  control  sequence. 
This  has  a  total  of  1424  bases  consisting  of  444  (31.2%) 
bases  in  the  coding  regions  and  980  (68.8%)  in  the  noncoding 
part.  Plot  1  in  Figure  la  shows  the  graphical  representation 
of  this  gene  starting  from  exon  1  through  introns  1  and  2  to 
exon  3.  Intron  1  is  G-rich  and  shows  a  horizontal  shift  to 
the  right;  intron  2  has  a  T-rich  part  in  the  initial  stages, 
represented  on  our  graph  as  an  almost  vertical  drop,  and  then 
a  long  stretch  of  TA  repeats  that  move  the  graph  generally 
in  a  southwesterly  direction  ending  with  exon  3  represented 
as  a  small  region  of  a  dense  cluster  of  points.  Exons  1  and 
2  are  also  represented  as  (less  dense)  clusters  of  points  unlike 
the  long  runs  of  the  introns;  we  have  elsewhere15  exploited 
this  characteristic  difference  between  intron  and  exon 
representations  as  a  means  for  determining  protein  coding 
regions  in  new  sequences. 

With  regard  to  the  problem  at  hand,  we  simulated  the 
effects  of  Rh  and  Cu(II)  toxicity  on  a  DNA  by  performing 
programmatically  random  deletions  of  several  guanines  in 
the  sample  sequence.  Such  deletions  will  tend  to  alter  the 
fX\{x)  in  the  default  representation  with  a  bias  toward  negative 
x-values  (because  of  a  higher  percentage  of  adenosines  in 
the  altered  sequence)  while  leaving  the//j(y)  unchanged  and 
will  consequently  alter  the  graph  radius.  Graphically,  the 
reductions  in  the  number  of  guanines  will  make  the  plot  shift 
to  the  left  in  the  default  reference  frame,  and  the  shift  will 
be  greater  for  a  greater  degree  of  deletions  effected.  This  is 
evident  visually  from  a  low  value  of  5%  deletions  in  the 
complete  sequence  (Figure  la).  The  values  of  A gR  for  dif¬ 
ferent  numbers  of  guanine  deletions  are  plotted  in  Figure 
lb. 

In  the  case  of  mutations,  the  graph  radius  is  quite  sensitive 
to  small  changes  and  to  specific  base  positions  affected.  A 
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Figure  1.  (a,  top)  Human  fi  globin  gene  and  its  model  modifica¬ 
tions  plotted  in  die  two-dimensional  representation  system.  Axes 
as  explained  in  the  text.  Plot  1  is  for  the  normal  human  ft  globin 
gene  complete  with  exons  and  introns.  Plot  2  is  for  the  same  gene 
with  5%  random  depletion  in  guanine  residues.  Plot  3  is  the  same 
gene  with  10%  depletion  in  guanine  bases,  (b,  bottom)  Plot  of 
changes  in  graph  radius  (AgR)  against  guanine  number  for  deletion 
of  guanines  in  positions  1  —  14. 

mutation  in  the  first  position,  reading  from  the  5 '-end,  effects 
the  maximum  change  while  a  mutation  in  the  last  base  has 
the  least  effect;  this  is  easily  understood  from  the  fact  that 
the  change  in  the  first  position  alters  the  coordinate  value  of 
each  subsequent  point  all  the  way  to  the  last  base  and  thus 
affects  the  value  of  much  more  than  would  be  the  case 
for  mutation  of  the  last  base.  (The  argument  remains  true 
when  read  from  the  3 '-end  and  as  long  as  one  is  consistent 
in  one’s  convention;  here  we  use  the  common  convention 
of  reading  from  the  5'-end.)  Figure  2  shows  AgR  plotted 
against  the  guanine  number  for  mutation  of  one  guanine  to 
cytosine  in  each  position  of  the  guanine  in  the  complete 
sequence  of  the  human  /?  globin  gene.  It  is  interesting  to 
note  that  A gR  has  a  unique  value  for  each  position,  and,  as 
can  be  expected,  the  value  goes  down  to  almost  zero  for  the 
last  guanine  (the  kink  seen  in  the  curve  occurs  at  a  large 
gap  between  successive  guanines).  Mutations  of  guanine  to 
adenosine  will  produce  smaller  amount  of  changes  in  AgR 
since  this  is  a  change  occurring  exclusively  in  the  x-direction 
and  lead  to  a  contraction  or  expansion  of  the  general  curve, 
whereas  the  previous  mutations  produced  a  change  in 
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Figure  2.  Plot  of  changes  in  graph  radius  (AgR)  against  the  guanine 
number  for  mutation  (G  to  C)  of  single  guanine  to  cytosine  at 
various  positions. 


Figure  3.  Plot  of  changes  in  graph  radius  (AgR)  against  the  guanine 
number  for  mutation  (G  to  A)  of  single  guanine  to  adenosine  at 
various  positions. 

direction  of  the  plot  in  our  default  axes  system.  Figure  3 
shows  the  variation  of  AgR  with  guanine  number  for  mutation 
of  a  single  guanine  to  adenosine.  We  have  noted  elsewhere18 
that  AgR  can  therefore  be  used  as  quantitative  descriptors 
for  indexing  single  nucleotide  polymorphic  genes. 

In  the  present  case  of  indexing  as  a  measure  of  risk 
assessment  for  toxicity,  the  sensitivity  of  AgR  raises  the 
question  of  adequate  knowledge  of  the  exact  location  of  the 
toxic  damage.  Since  any  random  mutation  or  deletion  could 
arise  from  the  genotoxic  effects,  it  would  be  preferable  to 
average  over  the  entire  range  of  values  of  AgR  over  the 
chosen  DNA  segment  to  arrive  at  an  acceptable  index  value 
for  purposes  of  comparative  assessment.  For  example,  for 
the  case  of  mutation  of  one  guanine  to  adenosine,  the  average 
value  of  AgR  is  0.064  while  that  for  the  case  of  guanine  to 
cytosine  is  0.537,  and  an  index  for  the  two  types  of  causative 
chemicals  that  produce  just  this  level  of  mutation  could  be 
written  in  thousandths  as  64  or  537. 

In  the  case  of  multiple  base  mutations  also  this  trend  of 
different  values  of  AgR  for  mutations  at  different  base 
positions  will  hold  true:  e.g.,  mutations  of  three  guanines 
to  cytosines  will  cause  maximum  deviation  from  gR  when 
the  mutations  occur  in  the  first  three  guanines  (AgR  = 
2.789  76  compared  to  the  unmutated  gene),  and  the  change 
will  be  least  when  the  mutations  take  place  in  the  last  three 
guanines  (AgR  =  0.031  41  compared  to  the  unmutated  gene). 
Multiple  mutations  will  therefore  create  a  field  of  values  for 
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Figure  4.  Plot  of  changes  in  graph  radius  (A gR)  against  the  number 
of  guanines  mutated  for  G  to  C  mutations.  The  upper  line  is  the 
highest  value  and  the  bottom  line  the  lowest  value  of  A gR  for  a 
given  number  of  mutations. 


Figure  5.  Plot  of  changes  in  graph  radius  (AgR)  against  the  number 
of  guanines  mutated  for  G  to  A  mutations.  The  upper  line  is  the 
highest  value  and  the  bottom  line  (not  visible  on  this  graph  range) 
the  lowest  value  of  A gR  for  a  given  number  of  mutations. 

AgR,  the  maximum  for  a  specific  number  of  mutations  being 
the  value  realized  from  mutations  in  the  first  of  those  bases. 
These  maximum  values  will  thus  form  an  envelope  as  shown 
in  Figure  4,  and  a  lower  bound  will  be  created  by  the 
minimum  values  of  AgR;  all  values  between  these  two 
boundaries  will  relate  to  the  different  bases  in  the  sequence 
that  can  be  mutated  for  any  specified  number  of  mutations. 
Figure  5  shows  similar  data  for  the  various  degrees  of  G  to 
A  possible  mutations. 

While  we  have  discussed  these  effects  on  the  hypothesis 
of  G  to  C  and  G  to  A  mutations,  these  results  can  be 
generalized  to  mutations  in  any  base  combinations  also.  For 
example,  in  the  case  of  genetic  mutations  induced  by  high 
levels  of  toxic  chemicals  where  more  than  one  base  can  be 
affected,  e.g.  mutations  of  the  type  GC  to  AT  shown  in 
Figure  6,  which  occurs  in  the  case  of  the  ethylnitrosourea 
and  ethyl  methanosulfonate  types  of  compounds,  one  can 
determine  the  value  of  AgR  from  a  sample  sequence  exposed 
to  a  standard  dosage  and  use  that  value  as  an  index  for 
measuring  the  least  number  of  mutations  that  can  be 
generated  from  such  a  number.  From  Figure  6,  for  example, 
it  can  be  seen  that  a  AgR  of  10  implies  that  the  number  of 
corresponding  mutations  will  be  five  GC  doublets  or  more. 

Thus  an  experimental  measure  of  AgR  for  a  given  dose  of 
a  toxic  chemical  can  lead  to  association  of  an  index  value 
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Figure  6.  Plot  of  changes  in  graph  radius  (AgR)  against  the  number 
of  GC  to  AT  mutations.  The  upper  line  is  the  highest  value  and 
the  bottom  line  the  lowest  value  of  AgR  for  a  given  number  of 
mutations. 

that  will  permit  easy  gradation  of  chemicals  on  levels  of 
toxicity.  Each  toxin  will  affect  a  DNA  in  its  own  unique 
ways:  some  by  deleting  a  preferred  base,  some  by  causing 
random  mutations  in  one  or  more  preferred  bases.  The 
usefulness  of  an  index  such  as  AgR  arises  from  associating 
one  number  with  each  dosage  level  of  each  chemical 
providing  an  easy  path  to  associating  risk  with  dosage 
without  having  to  enumerate  which  base  and  how  many  are 
mutated  or  deleted.  AgR  thus  enables  a  normalization 
approach  to  risk  assessment  of  genotoxic  chemicals  where 
no  other  such  measure  is  readily  available. 

Note  that  the  method  is  not  dependent  on  the  type  of  DNA 
sequence  used;  while  for  some  chemicals  specific  DNA 
segments  will  be  susceptible  to  damage,  for  others  damages 
can  occur  in  any  of  the  coding  or  noncoding  segments  as 
for  example  in  case  of  Cu(II)  and  Rh  induced  damages.  The 
indexing  can  be  done  for  all  these  cases  with  respect  to  any 
standard  sequence  segment  chosen. 

CONCLUSION 

Thus  we  see  that  the  concept  of  graph  radius  in  a  graphical 
representation  of  a  DNA  sequence  can  be  extended  to  make 
quantitative  estimation  of  any  changes  in  the  sequence.  This 
observation  indicates  that  it  is  possible  to  consider  using  such 
quantitation  as  an  index  of  the  intensity  of  the  effects  in  the 
case  of  changes  arising  out  of  effects  of  genotoxic  chemicals. 
As  of  now,  however,  we  are  restricted  by  the  paucity  of 
experimental  data  to  only  indicating  the  use  of  A gR  as  a 
possible  index;  experimental  work  so  far  are  generally  in 
the  nature  of  inquiries  into  the  kinds  of  changes  induced  in 
DNA  sequences  by  genotoxic  chemicals,  whereas  building 
up  a  quantitative  index  would  require  controlled  experiments 
relating  dosage  and  the  extent  of  DNA  damage. 

Our  work  has  shown  that  AgR,  the  change  in  #R,  is  a  very 
sensitive  indicator  of  changes  in  a  sequence  arising  out  of 
base  depletions  and  mutations.  This  provides  us  therefore  a 
numerical  descriptor  of  the  alterations  in  base  distribution 
and  composition  of  DNA  sequences  and  can  be  used  to 
compare  with  any  standard  or  control  sequence.  AgR, 
therefore,  averaged  over  its  relevant  range  of  values,  can  be 
used  as  a  numerical  descriptor  to  provide  a  measure  of  the 
genotoxic  effects  of  chemicals  such  as  oxidants  such  as  Rh 
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and  Cu(II),  or  acrolein,  ethyl  methanosulfonate,  or  any  other 
chemicals  whose  effect  on  DNA  sequences  can  occur  in  a 
random  manner  and  therefore  can  affect  any  part  of  the  DNA 
whether  coding  or  noncoding.  In  the  case  of  genotoxins  that 
affect  specific  genes  or  base  combinations,  the  AgR  will  need 
to  be  calculated  for  those  specific  genes  only,  and  there  the 
sensitivity  of  the  measure  can  be  exploited  to  provide  an 
indicator  of  the  genotoxicity  level  of  the  chemicals. 
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We  calculated  202  molecular  descriptors  (topological  indices,  TIs)  for  two  chemical  databases  (a  set  of  139 
hydrocarbons  and  another  set  of  1037  diverse  chemicals).  Variable  cluster  analysis  of  these  TIs  grouped 
these  structures  into  14  clusters  for  the  first  set  and  18  clusters  for  the  second  set.  Correspondences  between 
the  same  TIs  in  the  two  sets  reveal  how  and  why  the  various  classes  of  TIs  are  mutually  related  and  provide 
insight  into  what  aspects  of  chemical  structure  they  are  expressing. 


INTRODUCTION 

A  major  part  of  the  current  research  in  mathematical 
chemistry,  chemical  graph  theory,  and  quantitative  structure - 
activity/property  relationship  studies  involves  topological 
indices.  Topological  indices  (TIs)  are  numerical  graph 
invariants  that  quantitatively  characterize  molecular  structure. 
A  graph  G  =  ( V ,  E)  is  an  ordered  pair  of  two  sets  V  and  £, 
the  former  representing  a  nonempty  set  and  the  latter 
representing  unordered  pairs  of  elements  of  the  set  V.  When 
V  represents  the  atoms  of  a  molecule  and  elements  of  E 
symbolize  covalent  bonds  between  pairs  of  atoms,  then  G 
becomes  a  molecular  graph  (or  constitutional  graph ,  because 
there  is  no  stereochemical  information).  Such  a  graph  depicts 
the  topology  of  the  chemical  species.  A  graph  is  characterized 
using  graph  invariants.  An  invariant  may  be  a  polynomial, 
a  sequence  of  numbers,  or  a  single  number.  A  numerical 
graph  invariant  (i.e.,  a  single  number)  that  characterizes  the 
molecular  structure  is  called  a  topological  index. 

OVERVIEW  OF  TOPOLOGICAL  INDICES  USED  IN  THE 
PRESENT  STUDY 

A  large  number  of  topological  indices  have  been  defined 
and  used.1-11  The  majority  of  TIs  are  derived  from  the 
various  matrices  corresponding  to  molecular  graphs.  The 
adjacency  matrix  A(G)  and  the  distance  matrix  D(G)  of  the 
molecular  graph  G  have  been  most  widely  used  in  the 
formulation  of  TIs.  Integer-number  local  vertex  invariants 
(LOVIs)  are  the  vertex  degrees  (vi)  and  the  distance  sums 
(distasums,  dt)  resulting  from  summation  over  rows  or 
columns  of  entries  in  the  adjacency  and  distance  matrices, 
respectively.  By  mathematical  operations  performed  on  such 
LOVIs,  one  can  obtain  a  molecular  descriptor,  i.e.,  a 
topological  index.  Wiener’s  index  W  (eq  l),2  the  Zagreb 
group  index  Mi  (eq  2),!1  Randic’s  connectivity  index,  %  (eq 
3), 4  the  higher-order  connectivity  indices,  n%,  for  paths  of 
length  n  defined  by  Kier  and  Hall,5  and  the  J  index  (eq  4)6 
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fall  in  this  category. 


v  =  (Zm 

(1) 

(2) 

X  = 

(3) 

(4) 

The  summations  in  formulas  3  and  4  are  over  all  edges 
i-j  in  the  hydrogen-depleted  graph.  The  numbers  q  of  graph 
edges  and  p  of  cycles  in  the  graph  are  introduced  into 
formula  4  in  order  to  avoid  the  automatic  increase  of  J  with 
graph  size  and  cyclicity.  Indeed,  for  an  infinite  linear  carbon 
chain  it  was  demonstrated  that  7  =  jr  =  3.14159.  The  nature 
of  atoms  can  be  taken  into  account  by  means  of  parameters 
based  on  their  relative  atomic  numbers,  electronegativities, 
or  covalent  radii,  with  respect  to  those  of  carbon  atoms, 
multiplying  the  corresponding  distasum  in  formula  4  for  J . 

The  mean-square-root  distance  D  derived  from  all  topo¬ 
logical  distances  (denoted  by  i  in  the  next  formula)  is  defined 
as6b 

D  =  l(Zii2)l(Jjii)]m  (5) 

For  taking  into  account  the  chemical  nature  of  atoms 
symbolized  by  vertices,  Kier  and  Hall  advocated  the  use  of 
“valence  connectivity  indices”.5a,b  These  are  calculated  with 
formulas  similar  to  Randic’s  (eq  3),  but  products  of  edge 
end  point  (or  path  vertex)  invariants  are  no  longer  of  vertex 
degrees  but  of  weights  (valence  delta  values  d,)  given  by 
formula  5 

&i  —  (Z,v  _  Hi)f(Zi  -  Z{v  —  1)  (6) 

where  Z,*v  stands  for  the  number  of  valence  electrons  in  atom 
/,  Z\  is  its  atomic  number,  and  H\  is  the  number  of  hydrogen 
atoms  attached  to  atom  i. 

The  most  recent  additions  to  the  Kier-Hall  armamentary 
of  TIs  are  electrotopological  state  indices.50 
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Another  class  of  molecular  descriptors,  the  information- 
theoretic  indices,  are  derived  from  an  entirely  different 
reasoning.  In  this  case,  the  complexity  or  mode  of  partition¬ 
ing  of  structural  features  is  decomposed  into  disjoint  subsets 
using  an  equivalence  relation;  a  molecular  complexity  index 
is  then  computed  using  Shannon’s  idea  of  information 
content  or  complexity.12  Real-number  local  vertex  invariants 
(LOVIs),  on  the  other  hand,  may  also  be  defined  starting 
from  different  matrices  other  than  A(G)  or  D(G)  or  by 
applying  information  theory  at  the  vertex  level.  Thus, 
topological  indices  U ,  V ,  X ,  and  Y  were  defined.13  Bonchev 
and  Trinajstic  described  several  information-theoretic  TIs 
reviewed  thoroughly  in  Bonchev’s  book.7 

The  information-theoretic  indices  developed  by  Basak  and 
co-workers  take  into  account  all  atoms  in  the  constitutional 
formula  (hydrogens  also  being  included),  and  one  considers 
the  information  content  provided  by  various  classes  of  atoms 
based  on  their  topological  neighborhood.  There  are  three 
main  types  of  informational  indices  developed  by  Basak  et 
al.:  IC  (mean  information  content  or  complexity  of  a 
hydrogen-filled  graph,  with  vertices  grouped  into  equivalence 
classes  having  r  vertices;  the  equivalence  is  based  on  the 
nature  of  atoms  and  bonds,  in  successive  neighborhood 
groups);  CIC  (complementary  information  content);  and  SIC 
(structural  information  content),  and  they  are  not  inter- 
correlated  with  other  TIs.  In  the  following  formula,  the 
summation  spans  the  range  from  i  =  1  to  /  =  r: 


ICr  =  “  'LiPi^ZlPi 

(10) 

sic,  =  icyiog2  n 

(11) 

CICf  =  log2  N  -  ICr 

(12) 

The  probability  that  a  randomly  selected  vertex  occurs  in 
the  ith  equivalence  class  is  denoted  by  p,-.  The  ICr,  SICr, 
and  CICr  indices  can  be  calculated  for  different  orders  of 
neighborhoods,  r  (r  =  0,  1,  2,  p),  where  p  is  the  radius 

of  the  molecular  graph  G.  At  the  Oth-order  level,  the  atom 
set  is  partitioned  solely  on  the  basis  of  its  chemical  nature; 
at  the  level  of  the  first-order  topological  neighborhood,  the 
atoms  are  partitioned  into  disjoint  subsets  on  the  basis  of 
their  chemical  nature  and  their  first-order  bonding  topology. 
At  the  next  level,  the  atom  set  is  decomposed  into  equiva¬ 
lence  classes  using  their  chemical  nature  and  bonding  pattern 
up  to  the  second-order  bonded  neighbors.  The  process  is 
continued  until  consideration  of  higher-order  neighbors  does 
not  yield  further  increase  in  the  number  or  composition  of 
disjoint  subsets. 

A  large  variety  of  real-number  local  vertex  invariants,  and 
thence  a  larger  variety  of  TIs,  were  described  on  the  basis 
of  converting  a  matrix  (A  or  D  for  instance)  into  a  system 
of  linear  equations.  This  is  done  by  means  of  two  column 
vectors  that  can  convey  topological,  chemical,  or  numerical 
information.  One  nonzero  vector  is  the  free  term  of  the 
system  of  equations.  The  other  one  (which  may  be  zero,  but 
this  restricts  the  choices  on  available  supplementary  informa¬ 
tion)  becomes  the  main  diagonal  of  the  matrix  (if  both  vectors 
were  zero,  then  some  negative  LOVIs  would  result  with 
difficulties  of  interpretation).  These  vectors  may  be  the 
following  integers:  Z  (atomic  number  of  the  atom  corre¬ 
sponding  to  each  vertex),  V  (vertex  degree),  /  (identity),  N 


(number  of  non-hydrogen  atoms,  or  order  of  the  graph),  Nk 
(power  k  of  N).  Less  frequently,  one  may  use  for  periodicity 
of  chemical  properties  real  numbers:  S  (electronegativity) 
or  R  (covalent  radius)  of  the  atom  corresponding  to  each 
vertex.  The  resulting  matrix  with  the  vector  for  the  main 
diagonal  constitutes  the  set  of  coefficients  for  the  N 
unknowns  that  represent  the  real-number  LOVIs  of  the  N 
vertices.  The  triplet  (matrix,  vector  for  the  main  diagonal 
and  vector  for  the  free  term)  also  serves  as  notation  for 
LOVIs  and  for  the  derived  TIs.  After  the  system  of  N  linear 
equations  is  solved,  the  LOVIs  (jc,)  are  assembled  into  a 
“triplet  TI”  based  on  one  of  the  following  operations: 

1.  summation,  X*jcf; 

2.  summation  of  squares,  S,*,2; 

3.  summation  of  square  roots,  Zpr,1'2; 

4.  sum  of  inverse  square  root  of  cross-product  over  edges 

tf,  I. jXiXj)-m‘ 

5.  product,  N[UiXi]]IN. 

Numbers  1-5  of  the  above  operations  after  the  triplet 
complete  the  notation  of  the  triplet  TIs.14 

To  conclude  this  brief  review  of  TIs,  one  should  mention 
recent  progress  that  includes  other  matrices  such  as  the 
reciprocal  distance  matrix  that  yields  Harary  indices,15  the 
regressive  distance  matrices,16  the  Szeged  matrix,17  and  the 
resistance  distance  matrix  that  affords  Kirchhoff  indices.18 
So-called  optimal  structural  descriptors  can  be  obtained  from 
some  TIs  by  varying  some  parameters  and  thereby  adapting 
them  to  the  database;19  alternatively,  in  Randic-type  formulas 
(eqs  3,  4)  the  exponent  is  allowed20  to  differ  from  V2.  Three- 
dimensional  molecular  descriptors  can  be  derived  from 
geometrical  and  topological  structural  features  of  molecules.21 

Each  of  the  indices  above-discussed  is  a  “global”  param¬ 
eter;  i.e.,  it  quantifies  certain  aspects  of  the  entire  molecular 
structure  using  a  single  number. 

It  is  clear  from  the  above  discussion  that  the  set  of  TIs  is 
a  group  of  heterogeneous  entities.  They  have  been  defined 
to  characterize  molecular  structure  on  the  basis  of  distinct 
objectives  and  motivations.  Despite  their  distinctive  char¬ 
acteristics,  TIs  share  certain  common  features.  A  topological 
index  maps  a  set  of  chemicals  C  into  the  set  R  of  real  or 
integer  numbers.  Therefore,  TIs  quantify  some  general 
aspects  of  molecular  architecture  such  as  size,  shape, 
symmetry,  bonding  type,  cyclicity,  branching  pattern,  etc. 

Topological  indices  have  been  used  for  isomer  discrimina¬ 
tion,  quantification  of  the  structural  similarity/dissimilarity 
of  molecules,  and  prediction  of  property/activity  from 
structure.19  The  widespread  use  of  TIs  obviously  encourages 
one  to  ask  some  fundamental  questions  about  them:  What 
is  the  fundamental  nature  of  TIs?  To  what  degree  are  they 
intercorrelated?  How  does  one  extract  orthogonal  information 
from  TIs? 

The  intercorrelation  of  TIs  was  studied  earlier  with  a 
limited  set  of  invariants.  Thus,  Motoc  and  Balaban22 
described  graphically  the  intercorrelations  of  the  few  TIs 
known  until  1981.  These  aspects  were  reviewed  in  the  early 
1980s.23  Basak  et  al.  studied  the  mutual  relatedness  of  a  set 
of  90  TIs  calculated  for  a  set  of  3692  diverse  chemicals.24 
A  third  study  by  Todeschini  et  al.  will  be  discussed  in  the 
last  section  of  this  paper. 

All  such  studies  were  limited  in  the  sense  that  they 
analyzed  data  on  a  smaller  and  less  diverse  group  of  TIs. 
Therefore,  in  this  paper,  we  have  studied  the  mutual 
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Table  1.  Summary  of  Chemical  Classes  or  Features  in  Databases 
Analyzed 


chemical  classes  or  features 

database  A 
(hydrocarbons) 

database  B 
(diverse) 

total  number  of  compounds 

139 

1037 

hydrocarbons 

139 

565 

alkanes,  cyclic  alkanes 

73 

206 

aromatics 

66 

288 

alkyl  benzenes 

29 

80 

fused  rings 

37 

56 

polycyclic  aromatics 

37 

49 

non-hydrocarbons 

0 

472 

halogen-containing  compounds 

359 

heteroatom-containing  compounds 

101 

(sulfur  or  phosphorus) 

Compounds  containing  both 

12 

halogens  and  heteroatoms 

organosulfides  ' 

105 

organophosphorus 

8 

relatedness  of  a  set  of  202  TIs.  We  have  also  tried  to  extract 
useful  and  orthogonal  structural  information  from  the 
calculated  TIs.  This  study  also  reports,  for  the  first  time,  a 
comprehensive  discussion  of  Basak’s  information  content 
indices  (ICri  SICr,  CICr),  the  triplet  indices  (proposed  by  one 
of  the  present  authors),  and  Balaban’s  average  distance-based 
connectivity  index  J  as  compared  to  the  traditional  and  more 
widely  used  indices. 

The  goal  of  this  paper  is  two-fold:  (a)  to  study  the  degree 
of  intercorrelation  among  the  various  types  of  topological 
indices  and  (b)  to  extract  mutually  uncorrelated  (orthogonal) 


topological  parameters  that  can  be  used  for  QSAR/QSPR 
studies,  quantitation  of  intermolecular  similarity/dissimilarity, 
and  characterization  of  real  and  virtual  combinatorial  librar¬ 
ies.  To  this  end,  we  studied  the  mutual  relatedness  of  a  set 
of  more  than  200  topological  indices  in  this  paper. 

METHODS 

Chemical  Databases.  There  were  two  sets  of  chemicals 
analyzed  in  this  study:  a  set  of  139  hydrocarbons  to  represent 
a  moderately  homogeneous  set  of  chemicals  and  a  set  of 
1037  diverse  chemicals.  The  hydrocarbons  consisted  of  73 
C3-C9  alkanes,  29  alkylbenzenes,  and  37  polycyclic 
aromatic  hydrocarbons.25  The  diverse  set  of  1037  compounds 
consists  of  those  chemicals  from  the  U.S.  EPA  ASTER 
system26  for  which  a  measured  boiling  point  was  available 
and  hydrogen-bonding  potential  (as  measured  by  HB1  =  0) 
did  not  exist.  The  composition  of  these  data  sets  is  indicated 
in  Table  1.  Table  2  presents  the  list  of  all  202  parameters 
calculated  in  this  study. 

Calculation  of  TIs.  The  TIs  calculated  for  this  study 
(some  of  which  are  included  in  Table  2)  include  Wiener 
number  W,2  molecular  connectivity  indices  as  calculated  by 
Randic4  and  Kier  and  Hall,5  frequency  of  path  lengths  of 
varying  size,5  information-theoretic  indices  defined  on 
distance  matrices  of  graphs  using  the  methods  of  Bonchev 
and  Trinajstic,7  Roy  et  al.,27  Basak  et  al., 28-31  and  Ray- 
chaudhury  et  al.,32  parameters  defined  on  the  neighborhood 
complexity  of  vertices  in  hydrogen-filled  molecular  graphs 28-32 


Table  2.  Symbols  and  Definitions  of  Topological  Parameters 

index  definition 

/WD  information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a  graph 
/%  mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a  graph 
/D  degree  complexity 
Hv  graph  vertex  complexity 

HD  graph  distance  complexity 

I c  information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences  of  distance  h 

O  order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen-filled  graph 

/orb  information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its  maximum  neighborhood  of  vertices 
M  i  a  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

Mi  a  Zagreb  group  parameter  =  sura  of  cross-product  of  degrees  over  all  neighboring  (connected)  vertices 

ICr  mean  information  content  or  complexity  of  a  graph  based  on  the  rth  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

SICr  structural  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

CICr  complementary  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

kX  path  connectivity  index  of  order  h  =  0-6 

h%  cluster  connectivity  index  of  order  h  =  3-6 

^pc  path-cluster  connectivity  index  of  order  h  =  4-6 

hXch  chain  connectivity  index  of  order  A  =  3-6 

hXb  bond  path  connectivity  index  of  order  h  =  0-6 

hXbc  bond  cluster  connectivity  index  of  order  h  =  3-6 

Vch  bond  chain  connectivity  index  of  order  h  —  3-6 

¥pc  bond  path-cluster  connectivity  index  of  order  h  =  4-6 

hX v  valence  path  connectivity  index  of  order  h  =  0-6 

*Xvc  valence  cluster  connectivity  index  of  order  A  =  3—6 

Vch  valence  chain  connectivity  index  of  order  h  =  3-6 

Vpc  valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Ph  number  of  paths  of  length  h  =  0- 10 

J  Balaban’s  J  index  based  on  distance 

J*  Balaban’s  J  index  based  on  bond  types 

/x  Balaban’s  J  index  based  on  relative  electronegativities 

P  Balaban’s  J  index  based  on  relative  covalent  radii 

triplet  Global  invariants  based  on  solutions  of  linear  equation  systems  using  the  adjacency  matrix  (A),  distance  matrix  (D),  and  column/row 

vectors:  distance  sums  (5),  atomic  number  (Z),  number  of  non-hydrogen  atoms  ( N  and  N 2),  vertex  degree  (V),  or  numerical  constants  (1). 
Notation  is  described  by  triplets  (e.g.,  AZV).  Results  are  weightings  for  each  atom  in  a  molecule.  These  weights  are  combined  by  five 
possible  formulas;  1  =  sum  of  weights,  I/*,-;  2  =  sum  of  squared  weights  Xi*,2;  3  =  sum  of  square  root  of  weights  Xf*/1/2;  4  =  sum  of 
cross-products  Xi(*r*y)~,/2;  and  5  =  product  of  weights  Af*[Xi*;I,/iV- 
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and  Balaban’s  J  indices6  as  well  as  triplet  indices.14  The 
majority  of  the  TIs  were  calculated  using  the  program 
POLLY  2. 3. 33  The  J  indices  and  triplet  indices  were 
calculated  using  software  developed  in-house  by  the  authors. 

STATISTICAL  ANALYSIS 

For  both  sets  of  chemicals,  the  computed  TIs  were 
transformed  by  the  natural  logarithm  of  the  index  plus  a 
constant,  generally  1.  This  was  done  since  the  scale  of  some 
indices  may  be  several  orders  of  magnitude  greater  than  that 
of  other  indices. 

For  each  set,  a  technique  known  as  variable  clustering  was 
performed  using  the  SAS  procedure  VARCLUS.34  The 
variable-clustering  procedure  divides  the  set  of  indices  into 
disjoint  clusters,  such  that  each  cluster  is  essentially  uni- 
dimensional.  This  is  accomplished  by  a  repeated  principal- 
components  analysis  of  the  sets  of  indices.  The  initial 
principal-component  analysis  examines  all  indices  and 
defines  two  principal  components  or  eigenvectors.  If  the 
eigenvalue  for  the  second  component  is  >1.0,  the  indices 
are  split  into  separate  clusters  by  correlating  the  indices  with 
the  first  and  second  principal  components.  Those  indices 
most  correlated  with  the  first  component  form  one  cluster 
and  those  indices  most  correlated  with  the  second  component 
form  another  cluster,  thus  forming  two  disjoint  clusters.  A 
principal-component  analysis  is  then  performed  for  each 
cluster  of  indices,  with  the  cluster  being  split  if  the  eigenvalue 
for  the  second  component  is  >  1 .0.  The  procedure  is  repeated 
until  the  second  eigenvalue  is  <1.0  for  all  clusters. 

RESULTS  AND  DISCUSSION 

The  first  database  (denoted  by  A)  consists  of  139 
hydrocarbons  (alkanes,  alkylbenzenes,  and  polycyclic  aro¬ 
matics)  and  162  TIs.  The  number  of  indices  examined  was 
reduced  from  the  original  202  by  removing  all  but  one  of 
the  degenerate  (i.e.,  correlation  of  1.0)  indices  and  those 
indices  that  were  constant  (0.0)  for  all  chemicals.  The  second 
database  (denoted  by  B)  is  a  diverse  one  and  contains  1037 
chemical  structures  and  176  nondegenerate,  nonconstant 
indices. 

The  results  of  the  variable-cluster  analysis  will  be  pre¬ 
sented,  first  discussing  how  the  descriptors  (variables)  for 
database  A  become  clustered  and  then  surveying  the  descrip¬ 
tor  clustering  for  database  B,  as  well  as  the  correspondence 
between  these  clusters.  Intercluster  correlation  will  then  be 
described. 

The  clusters  have  been  ordered  according  to  decreasing 
numbers  of  descriptors  in  each  cluster;  when  clusters  contain 
the  same  number  of  descriptors,  the  numbering  of  the 
corresponding  clusters  is  arbitrary. 

In  Figure  1,  one  can  see,  in  graphical  form,  on  the  left- 
hand  side  the  points  denoting  the  clusters  that  group  together 
the  descriptors  for  the  hydrocarbon  database  A  and  on  the 
right-hand  side  those  corresponding  to  the  diverse  database 
B.  Each  cluster  is  denoted  by  a  letter  (A  or  B)  and  a  number. 
The  total  number  of  variables  in  each  cluster  is  written  under 
each  point.  Full  lines  connect  A-type  with  B-type  clusters, 
having  inscribed  on  them  the  numbers  of  descriptors  common 
to  each  pair  of  clusters;  when  no  number  is  inscribed,  this 
indicates  a  single  common  descriptor.  Dashed  side  lines 
denote  the  descriptors  that  do  not  have  counterparts  in  the 
other  set  of  clusters,  and  the  associated  numbers  on  these 
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Figure  1.  Associations  between  clusters  of  descriptors  for  the 
hydrocarbon  database  (A-type  clusters)  and  the  database  with 
diverse  compounds  (B-type  clusters).  Solid  lines  connect  A-type 
descriptors  with  B-type  descriptors,  and  the  numbers  of  common 
descriptors  are  indicated  on  such  lines  (when  no  number  is  indicated, 
there  is  just  one  common  descriptor).  Dashed  lateral  lines  indicate 
descriptors  that  have  no  correspondence  for  the  other  type. 

side  lines  indicate  the  numbers  of  such  “orphan”  descriptors. 
Because  the  two  data  sets  differ  both  in  the  numbers  of 
compounds  and  in  their  structures,  it  is  normal  to  expect  that 
clusters  for  one  data  set  will  have  counterparts  in  several 
clusters  in  the  other  data  set.  This  is  indeed  what  was  found 
to  happen,  as  will  be  shown  below  when  the  diverse  data 
set  will  be  analyzed. 

Only  in  a  single  case  have  we  found  a  one-to-one 
correspondence  between  clusters  of  descriptors  corresponding 
to  the  two  data  sets  (A  12  and  B14).  Nevertheless,  in  several 
instances  (A6,  All;  B4,  B9,  B15,  B16,  and  B17),  a  cluster 
for  one  data  set  (say,  A)  was  found  to  have  all  its  descriptors 
in  common  with  only  one  cluster  of  the  other  data  set  (say, 
B);  however,  this  latter  cluster  also  contains  descriptors  found 
in  more  than  one  cluster  of  the  other  set. 

Clustering  of  Descriptors  for  Hydrocarbons,  The  de¬ 
scriptors  for  database  A  are  grouped  in  14  clusters  sum¬ 
marized  in  Table  3.  Cluster  Al  has  54  from  the  total  of  162 
descriptors;  therefore,  it  groups  together  about  one-third  of 
all  variables.  These  descriptors  depend  on  both  the  shape 
and  the  size  (magnitude)  of  the  molecular  graph;  such 
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Table  3.  Summary  of  Variable  Clustering  for  139  Hydrocarbons 


duster 

number  of 
variables 

representative  variables 
(max,  25%  of  total  listed) 

Al 

54 

DN2Z4,  DN2N„,  Po,  AZV4,  ASZ4,  ANNj, 
ANNS,  AZNj 

A2 

19 

6*.  Pi.  3X.  V .  V 

A3 

13 

y.y.ANzi 

A4 

13 

SIC6,SIC5,IC6 

A5 

12 

DSZj,  DSZ5,  ASZj 

A6 

9 

dsz3,dsn5 

A7 

9 

DSN3tDN2N, 

A8 

6 

Yc.  Vc 

A9 

6 

dsz2,  asz2 

A10 

5 

SICi 

All 

4 

CIC, 

A12 

4  ... 

Yc 

A 13 

4 

SIC3 

A14 

4 

5Zch 

descriptors  include  the  Randic  connectivity  index,  the  Kier- 
Hall  simple  path  connectivity  indices,  the  Zagreb  group 
indices,  and  many  triplet  indices  having  as  the  main  diagonal 
column  vector  the  atomic  number  Z  or  the  total  number  N 
of  vertices. 

Cluster  A2  with  about  Vg  of  the  total  number  of  descriptors 
includes  molecular  connectivity  indices  of  order  higher  than 
5,  the  J  indices,  and  two  closely  similar  triplet  indices.  Cluster 
A3  contains  mainly  valence/bond-corrected  molecular  con¬ 
nectivity  indices.  The  next  cluster,  A4,  consists  mainly  of 
the  information-based  indices  IC  (information  content),  SIC 
(structural  information  content),  and  CIC  (complementary 
information  content)  for  the  hydrogen-filled  graphs  of  order 
higher  than  2  for  IC  and  higher  than  3  for  SIC  and  CIC. 
Cluster  A5  is  composed  mainly  of  triplet  indices  having  as 
main  diagonal  unit  vectors  either  distance  sums  or  total 
number  N  of  vertices. 

Each  of  the  remaining  clusters  has  less  than  10  descriptors. 
Clusters  A6  and  A7  contain  mostly  triplet  descriptors:  A6 
with  the  distance  sum  S  and  A7  with  the  order  N  of  the 
hydrogen-depleted  graph,  as  the  main  diagonal  unit  vector; 
cluster  A7  also  includes  two  simple  path  cluster  molecular 
connectivity  indices.  Cluster  A 8  contains  simple  cluster-  and 
bond/valence-corrected  cluster  connectivities  of  high  order 
(4-6).  Cluster  A9  again  consists  exclusively  of  triplet 
indices,  and  they  are  based  on  summing  squares  of  LOVIs 
based  mainly  on  distance  sum  unit  vectors  on  the  main 
diagonal. 

Cluster  A 10  includes  three  information-theoretic  indices 
IC  and  SIC  of  low  order  (0  and  1)  as  well  as  two  triplet 
indices  having  in  common  the  two  unit  vectors  (distance  sum 
S  for  the  main  diagonal,  vertex  degree  V  for  the  free  term) 
and  the  operation  for  assembling  LOVIs  into  an  index 
(summation  of  LOVI  square  roots). 

Interestingly,  the  four  smallest  clusters  having  four 
descriptors  each  are  pairwise  similar  in  type:  A 1 1  with  A 1 3, 
and  A 12  with  A 14.  Cluster  All  consists  of  information  TIs 
(IC,  SIC,  CIC)  of  low  order  (0-2),  whereas  A13  includes 
the  same  TIs  of  slightly  higher  order  (2  and  3).  Clusters  A 12 
and  A 14  group  together  molecular  connectivity  indices  based 
on  simple  cluster  and  simple  cycle,  respectively. 

A  general  remark  for  the  triplet  indices  is  that  what  groups 
them  together  is  not  the  matrix  on  which  they  are  based 
(adjacency  matrix  or  distance  matrix)  but  the  two  unit  vectors 
that  convert  such  matrices  into  systems  of  linear  equations. 
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Table  4.  Summary  of  Variable  Clustering  for  1037  Diverse 
Chemicals 


cluster 

number  of 
variables 

representative  variables 
(max.  25%  of  total  listed) 

Bl 

49 

Po,  ANNj,  ANNs,  AN13,  ANN,,  ANV„, 

asi4,  dn2i4 

B2 

13 

anv,,p3,  m2 

B3 

13 

AS  1 1,  AS15,  DSli 

B4 

13 

y  y .  P7 

B5 

11 

ASN5,  AS13,  ASN, 

B6 

10 

SIC3,SIC4,CIC4 

B7 

9 

y  pc,  ye 

B8 

8 

ASZ2,  ASZ, 

B9 

6 

Yc.  sXc 

B10 

6 

3ta,  yCh 

B 1 1 

6 

IC4,  IC5 

B 12 

6 

CIC,,  SIC, 

B 13 

6 

6Xvch,  Yen 

B 14 

6 

Vc.  4*c 

B15 

4 

7B 

B16 

4 

AS12 

B17 

4 

DN2N, 

B18 

2 

ANS, 

Clustering  of  Descriptors  for  the  Diverse  Set  of 
Compounds.  There  are  18  variable  clusters  grouping  to¬ 
gether  176  variables  for  the  database  of  1037  diverse 
compounds  (Table  4).  Cluster  Bl,  with  49  descriptors, 
includes  28%  of  all  variables;  35  of  these  descriptors  are 
common  to  cluster  Al.  Some  of  these  indices,  e.g.,  W 
(Wiener  number),  P0  (number  of  non-hydrogen  atoms),  and 
Pi  (number  of  bonds  in  the  hydrogen-depleted  graph), 
express  molecular  size.  It  is  interesting  that  most  of  the  triplet 
variables  (AZV,-,  AZN,-,  and  ANN,-  with  i  =  1-5  as  well  as 
several  other  ones)  are  found  to  be  common  to  clusters  Al 
and  Bl.  Five  other  descriptors  (V,  2%b,  y,  y,  and  y) 
also  appear  in  both  clusters  Al  and  Bl. 

Cluster  B2  has  nine  variables  in  common  with  cluster  Al; 
most  of  these  (3^,  *%,  Pi-Pa)  are  path  connectivities  of 
intermediate  order.  A  couple  of  triplet  indices  (ANVi  and 
ANV5)  are  also  in  common  with  cluster  A 1;  another  pair  of 
triplet  indices  (ASN3  and  ASN4)  are  in  common  with  cluster 
Al. 

Cluster  B3  contains  triplet  indices  with  distance  sums  as 
main  diagonal  vector;  they  occur  in  clusters  A5  and  A9.  In 
addition,  two  descriptors  (1^  and  HD)  appear  also  in  cluster 
Al. 

Cluster  B4  is  uniquely  associated  with  cluster  A2  and 
consists  of  indices  *%t y  y,  V,  y,  V,  and  P6-P 10.  These 
descriptors  are  based  on  long  paths;  therefore,  these  variables 
appear  only  when  large  molecules  are  involved. 

Seven  of  the  eleven  variables  of  cluster  B5  form  exclu¬ 
sively  cluster  A6;  they  are  related  to  molecular  shape  via 
vertex  complexity  and  graph  radius.  Five  triplet  indices  such 
as  ASNi,  ASN5,  DSNi,  DSN5,  and  ANV2  also  are  common 
to  these  two  clusters. 

Very  interesting  correspondences  are  manifested  by  cluster 
B6,  which  is  mainly  associated  with  two  clusters  involving 
the  hydrocarbon  database,  namely,  A4  and  A13  (plus  one 
descriptor  in  B6  that  appears  in  A10).  All  variables  are  of 
information-theoretic  type.  These  higher-order  variables 
(SIC3-SIC6  and  CIC3-CIC6)  are  common  to  clusters  B6 
and  A4  and  represent  a  true  measure  of  molecular  complex¬ 
ity.  The  lower-  and  intermediate-order  indices  such  as  ICf 
or  SIC2  that  appear  in  clusters  B6  and  A10  or  B6  and  A13, 
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respectively,  provide  information  on  lower-order  complexity 
that  may  be  more  degenerate  than  that  furnished  by  the 
higher-order  information  indices.  One  should  stress  here  that 
information  content  indices  form  clusters  that  are  separate 
from  clusters  with  other  descriptors,  meaning  that  such 
variables  convey  unique  information  relative  to  structure  and 
molecular  complexity. 

Cluster  B7  consists  only  of  path-cluster  molecular  con¬ 
nectivity  descriptors  that  were  included  in  clusters  A3,  A7, 
and  A8  for  the  hydrocarbons. 

Cluster  B8  includes  triplet  indices,  all  of  which  have  the 
atomic  number  Z  for  the  free-term  vector  in  the  system  of 
linear  equations.  Most  of  these  descriptors  appear  in  clusters 
Al,  A5,  and  A9. 

Cluster  B9  consists  of  high-order  connectivity-cluster  terms 
all  contained  in  cluster  A8.  For  hydrocarbons,  descriptors 
6Xbc  and  6x\  are  perfectly  correlated  with  descriptor  6#c; 
therefore,  the  former  variables  did  not  appear  in  the 
hydrocarbon  cluster  A8.  For  the  diverse-compound  database, 
such  a  correlation  is  not  perfect  because  of  differences  in 
atom  types. 

An  interesting  observation  concerns  cluster  BIO:  all  six 
variables  are  absent  from  the  hydrocarbon  database  because 
the  database  does  not  contain  any  three-  or  four-membered 
rings,  unlike  the  diverse  compound  database.  This  is  why 
indices  3/4*ch,  3/Vch,  and  3/Vch  appear  only  in  cluster  BIO. 

Cluster  B 1 1  has  all  but  one  of  its  descriptors  contained  in 
cluster  A4;  these  information  content  indices,  IC2— IC6, 
measure  a  high  degree  of  nonredundancy  of  topological 
neighborhoods. 

Cluster  B 12  has  four  of  its  variables  contained  in  cluster 
All;  these  descriptors  (SIC0,  CIC0-CIC2)  express  lower- 
order  redundancy  of  topological  neighborhoods.  This  is  true 
of  indices  IC0  and  SIC i  as  well,  which  are  present  in  cluster 
A10. 

From  cluster  B13,  the  six  descriptors  (simple,  bond-  and 
valence-corrected  chain  molecular  connectivity  indices)  are 
partitioned  equally  between  clusters  A2  and  A 14,  according 
to  the  six-  versus  five-membered  ring  size,  respectively;  in 
the  hydrocarbon  database  A,  six-membered  chain  (or  rings) 
predominate. 

Cluster  B14  is  exclusively  associated  in  a  one-to-one 
relationship  with  cluster  A 12.  The  corresponding  descriptors 
3Xc  and  Axc  as  well  as  their  bond-  and  valence-corrected 
counterparts  represent  connectivity  indices  on  three-  and  four- 
vertex  structural  clusters.  For  the  hydrocarbon  database,  we 
have  again  a  case  in  which  the  two  indices  Axbc  and  Vc, 
perfectly  correlated  with  Axc,  do  not  appear  explicitly  in 
cluster  A12. 

Half  of  the  variables  (7-type  indices)  in  cluster  B15  are 
contained  in  cluster  A2.  These  7  indices  again  form  a  cluster 
apart  from  all  other  ones  in  the  case  of  the  diverse  database, 
proving  that  when  heteroatoms  are  taken  into  account,  the 
information  provided  by  such  7-type  indices  is  unique. 

Clusters  B 16,  B 17,  and  B 18  each  have  a  small  number  of 
triplet-type  descriptors;  the  three  descriptors  of  cluster  B17 
are  all  contained  in  cluster  A7. 

Intercluster  Correlations.  From  each  cluster  we  select 
15-25%  of  the  descriptors  according  to  the  maximal  value 
of  the  correlation  coefficient  with  their  own  cluster.  In  most 
cases,  the  first  selected  descriptor  also  has  the  minimal  value 
of  the  correlation  with  the  next  closest  cluster,  expressed  by 
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Figure  2.  Graph  of  highly  correlated  topological  indices  (TIs) 
according  to  Todeschini  et  al.  (notation  of  TIs  as  in  Table  3  of  ref 
31).  Lines  connect  TIs  with  r  >  0.90. 


the  1  -  r2  value.  When  more  than  one  index  is  chosen  from 
the  same  cluster,  after  the  first  one  was  selected  as  indicated 
above,  the  next  one  must  also  fulfill  a  third  criterion,  namely, 
a  low  intercorrelation  with  the  previously  selected  indices 
of  the  same  cluster. 

There  were  four  intercluster  correlations  within  the 
hydrocarbon  data  set  that  were  greater  than  0.9,  and  all 
involved  cluster  Al.  Cluster  Al  was  positively  correlated 
with  A2,  A3,  and  Al.  Cluster  Al  was  correlated  negatively 
with  A5.  Each  of  the  clusters  characterizes  some  aspect  of 
molecular  size  and  shape. 

Cluster  B 1  showed  an  intercluster  correlation  of  0.92  with 
cluster  B2  and  -0.90  with  cluster  B3.  These  were  the  only 
intercluster  correlations  greater  than  0.9.  These  clusters  are 
the  three  largest  clusters  in  set  B.  Like  cluster  Al,  cluster 
B1  groups  TIs  expressing  molecular  size  and  shape.  Interest¬ 
ingly,  in  set  A  cluster  Al  also  had  a  negative  intercluster 
correlation  with  cluster  A5;  it  is  therefore  not  surprising  that 
clusters  A5  and  B3  have  the  most  abundantly  populated  line 
connecting  them  in  Figure  1. 

In  summary,  for  the  hydrocarbon  database  there  are  four 
intercluster  correlations  with  r  >  0.90  all  involving  on  one 
hand  the  first  cluster  Al  and  on  the  other  hand  clusters  A2, 
A3,  A5,  and  Al.  For  the  diverse  compound  database  there 
are  only  two  such  intercluster  correlations  with  r  >  0.90, 
namely,  B1  with  B2  and  B3.  This  is  not  unexpected,  as  the 
combination  of  the  first  three  clusters  in  each  case  contains 
more  descriptors  than  the  parameters  remaining  in  all  the 
remaining  ones  together. 

In  this  context,  one  should  mention  that  Todeschini  and 
co-workers  published  an  interesting  study35  on  23  TIs  for  a 
set  of  667  diverse  chemicals,  20%  of  which  were  hydrocar¬ 
bons;  the  above  authors  excluded  10  of  these  TIs  because 
they  were  degenerate,  or  redundant  or  had  intercorrelation 
factors  higher  than  0.90.  A  graph  depicting  highly  intercor- 
related  indices  using  data  published  by  these  authors  is 
presented  in  Figure  2,  which  is  similar  to  a  graph  published 
earlier.22 

Ten  TIs  were  then  selected  by  Todeschini  et  al.,35  namely, 
the  molecular  weight  (A/w),  7,  IC,  CIC,  the  bonding  informa¬ 
tion  content  (BIC),  mean  Randic  connectivity  (%),  the 
information  content  on  atomic  composition  (/AC),  the  mean 
Wiener  index  (W),  and  the  mean  information  indices  on 
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equality  of  distance  degree  and  on  the  magnitude  of  distance 
degree  (/ED,deg  and  /WD,dCg,  respectively).  Then,  using  principal- 
component  analysis  for  the  above  10  TIs,  Todeschini  et  al. 
analyzed  the  composition  of  the  first  six  principal-compo¬ 
nents.  They  found  that  the  first  PC  is  mainly  composed  of 
indices  that  express  the  size  of  molecules  (Afw,  W,  IC,  7ED,deg 
and  7WD.deg).  This  is  in  agreement  with  the  earlier  finding  of 
Basak  et  al.  for  a  set  of  3692  diverse  chemicals  that  the  first 
PC  is  related  to  molecular  size.29  Further,  Todeschini  et  al. 
found  that  the  second  PC  is  dominated  by  indices  expressing 
information  on  bonds  (IC,  CIC,  and  BIC).  This  is  also 
analogous  to  the  results  reported  by  Basak  et  al.29  that  the 
second  axis  represents  molecular  complexity  as  encoded  by 
higher-order  neighborhood  complexity  indices  (IC2,  IC3, 
SIC2,  SIC3,  CIC2,  CIC3,  etc.).  The  IC,  CIC,  and  BIC  indices 
used  by  Todeschini  et  al.  are  based  solely  on  first-order 
topological  bonding/neighborhoods  and  slightly  different 
equivalence  relations  as  compared  to  the  ICr,  SICr,  and  CICr 
indices  defined  by  Roy  et  al.27  In  studies  by  Basak  et  al.,29 
the  first-order  complexity  indices  (IC  1,  SIC  1,  CIC 0  were 
usually  most  correlated  with  the  first  PC.  Each  of  the  next 
four  PCs  in  Todeschini  et  al.’s  study35  is  dominated  by  a 
single  TI,viz.,  /Ac,  J  (indicating  branching),  and  7ED,dcg 
(connected  with  the  position  of  substituents  on  the  molecular 
scaffold),  respectively. 
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Abstract.  Mathematical  invariants  are  frequently  used  for  the  characteriza¬ 
tion  of  molecular  graphs.  Such  invariants  quantify  structural  features  of  chem¬ 
icals  like  size,  shape,  symmetry,  cyclicity,  complexity,  branching,  etc.  Numeri¬ 
cal  graph  invariants  or  topological  indices  (TIs)  have  been  used  in  developing 
quantitative  structure-property/  activity/  toxicity  relationship  models  and  in 
defining  intermolecular  similarity.  In  this  paper,  we  have  used  a  set  of  TIs  and 
a  class  of  substructures  called  atom  pairs  (APs)  in  selecting  analogs  of  probe 
chemicals  from  a  set  of  mutagens.  The  result  shows  that  both  of  the  similarity 
methods  select  analogs  which  have  reasonable  structural  similarity  with  the 
query  chemicals.  Such  analogs,  selected  computationally,  can  be  useful  in  the 
hazard  assessment  of  chemicals  for  which  very  little  or  no  toxicity  data  are 
available. 


1.  Introduction 

A  contemporary  interest  irf  mathematical  chemistry  is  the  characterization  of 
molecular  structure  using  graph  theoretic  formalism  [1]— [11].  A  graph  G  =  [V,E\ 
consists  of  an  ordered  pair  of  two  sets  V  and  E ,  representing  the  vertices  and  edges, 
respectively.  G  becomes  a  molecular  graph  when  the  set  V  represents  the  set  of 
atoms  in  a  molecule  and  the  set  E  symbolizes  chemical  bonds  between  adjacent 
atoms  [8]. 

Mathematical  characterization  of  molecular  graphs  (structures)  may  be  accom¬ 
plished  using  graph  invariants.  An  invariant  may  be  a  polynomial,  a  sequence  of 
numbers,  or  a  real  number.  A  real  number  characterizing  a  molecular  graph  is  called 
a  topological  index  (TI).  TIs  quantify  different  aspects  of  molecular  architecture, 
viz.,  size,  shape,  cyclicity,  branching,  symmetry,  etc  [8], 

TIs  have  been  used  extensively  in  quantitative  structure-property/ activity  rela¬ 
tionships  (QSPR  and  QSAR  respectively)  and  the  quantification  of  intermolecular 
similarity/dissimilarity  of  chemicals  [10]— [24].  In  quantitative  molecular  similarity 
analysis  (QMSA)  studies,  TIs  have  been  used  to  derive  high  dimensional  structure 
spaces  where  the  Euclidean  distance  Dij  between  a  pair  of  molecules  i  and  j  is  used 
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to  quantify  the  similarity  between  them.  Similarity  measures  can  be  used  either  for 
the  selection  of  analogs  of  chemicals  or  in  the  prediction  of  the  property/activity 
of  a  molecule  from  the  property  of  its  selected  neighbor(s). 

In  some  of  our  recent  QSAR/QMSA  studies  we  have  used  different  similarity 
measures  derived  from  TIs  in  the  selection  of  analogs  and  prediction  of  proper¬ 
ties/activities  for  diverse  sets  of  chemicals.  We  have  also  used  orthogonal  descrip¬ 
tors  derived  from  a  set  of  over  100  graph  invariants  to  estimate  bioactivity/toxicity 
of  different  graphs  of  molecules.  In  this  paper  we  have  used  similarity  measures 
derived  from  TIs  in:  a)  selecting  analogs  of  an  isospectral  graph  from  a  diverse  set 
of  221  compounds,  and  b)  predicting  the  mutagenicity  of  a  set  of  113  mutagens  and 
non-mutagens  using  QMSA  methods. 

2.  Methods 

2.1.  Databases.  A  set  of  19  pairs  of  isospectral  graphs  from  the  work  of 
Balasubramanian  and  Basak  [25]  were  added  to  a  set  of  107  benzamidines  [26]  and 
a  composite  set  of  76  diverse  compounds  used  in  an  earlier  study  by  Basak  and 
Grunwald  [23]  to  create  a  varied  library  of  221  compounds.  This  composite  library 
was  created  to  provide  a  large  set  containing  both  congeneric  and  non-congeneric 
sets  to  test  analog  selection  methods.  The  chemical  structures  for  the  19  pairs  of 
isospectral  graphs  have  been  presented  previously  [25]. 

A  second  data  set,  representing  a  subset  of  the  set  of  277  chemicals  presented  by 
Yamaguchi  et  al  [27]  was  also  used  in  the  current  study.  This  subset  consisted  of  all 
the  chemicals  in  the  set  of  277  chemicals  that  had  reported  results  for  mutagenicity 
in  the  Ames  test,  mutagenicity  in  the  medium  term  liver  carcinogenesis  bioassay, 
and  carcinogenicity  in  the  two-year  rodent  bioassay  in  rat  and/or  mouse.  This 
subseting  resulted  in  a  set  of  113  chemicals,  68  of  which  are  classified  as  non¬ 
mutagens  and  45  of  which  are  classified  as  mutagens  in  the  Ames  test.  This  set  of 
chemicals  and  their  observed -mutagenicity  are  reported  in  Table  1. 

Table  1:  Mutagenicity  in  the  Ames  test  for  113  chemicals 


No.°  Compound  Name 


Obs.  Ames 
Mutagenicity 


1.5 

butylated  hydroxyanisole  (BHA) 

0 

1.6 

caffeic  acid 

0 

1.7 

catechol 

0 

1.8 

clofibrate 

0 

1.9 

di(2-ethylhexyl)phthalate  (DEHP) 

0 

1.10 

hydroquinone 

0 

1.11 

p-methoxyphenol 

0 

1.12 

sesamol 

0 

1.13 

tamoxifen 

0 

1.14 

acetaminophen 

0 

1.15 

benzoin 

0 

1.16 

EPN 

0 

1.17 

gallic  acid 

0 

1.18 

a-tocopherol 

0 

2.2 

2-acethylaminofluorene  (AAF) 

1 
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Table  1:  Mutagenicity  in  the  Ames  test  for  113  chemicals 


Obs.  Ames 


No." 

Compound  Name 

Mutagenicity 

2.3 

adriamycin 

1 

2.4 

aflatoxin  B1 

1 

2.5 

benzo[a]pyrene 

1 

2.7 

captafol 

1 

2.8 

captan 

1 

2.9 

carbazole 

1 

2.10 

dibutylnitrosamine  (DBN) 

1 

2.11 

diethylnitrosamine  (DEN) 

1 

2.12 

3,2’-dimethyI-4-aminobiphenyl  (DMAB) 

1 

2.14 

dimethylnitrosamine  (DMN) 

1 

2.15 

N-ethyi-N-hydroxyethylnitrosamine  (EHEN) 

1 

2.16 

N-ethyl-N-nitrosourea  (ENU) 

1 

2.20 

hydrazobenzene 

1 

2.22 

laciocarpine 

1 

2.26 

3 ’-methyl-4-dimethylaminoazobenzene  (3’-Me-D AB) 

1 

2.27 

3-amino-9-ethylcarbazole 

1 

2.28 

N-nitrosooxazolidine 

1 

2.29 

N-nitrosodi-n-propylamine  (NDPA) 

1 

2.30 

N-nitrosomorpholine 

1 

2.31 

N-nitrosopiperidine 

1 

2.32 

N-nitrosopyrrolidine 

1 

2.33 

quinoline 

1 

2.34 

sterigmatocystin 

1 

2.35 

4,4,-thiodianiline* 

1 

2.42 

alachlor 

0 

2.43 

aldrin 

0 

2.44 

auramine  0 

0 

2.45 

barbital 

0 

2.46 

chlordane 

0 

2.47 

chlorendic  acid 

0 

2.48 

chlorobenzilate 

0 

2.49 

DDT 

0 

2.50 

dieldrin 

0 

2.51 

diethylstilbestrol 

0 

2.53 

ethenzamide 

0 

2.54 

17a-ethinyl  estradiol 

0 

2.55 

DL-ethionine 

0 

2.56 

hexachlorobenzene  (HCB) 

0 

2.57 

a-hexachlorocyclohexane  (a-HCH) 

0 

2.58 

d-limonene 

0 

2.59 

monoclotaline 

0 

2.60 

N-nitrosodiethanolamine 

0 

2.61 

phenobarbital 

0 

2.64 

safrole 

0 
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Table  1:  Mutagenicity  in  the  Ames  test  for  113  chemicals 


No.a  Compound  Name 


Obs.  Ames 
Mutagenicity 


2.66  thioacetamide  0 

2.67  triadimefon  0 

2.68  trifluralin  0 

2.£9  urethane  0 

2.70  polychlorinated  biphenyl  (PCB)  0 

2.71  malathion  0 

2.72  vinclozolin  0 

3.1  acetophenetidine  (phenacetin)  1 

3.2  azathioprine  1 

3.3  N-butyl-N-(4-hydroxybutyl)nitrosamine  (BBN)  1 

3.4  chrysazin  (danthron)  1 

3.5  4,4’~diaminodiphenylmethane  (DDPM)  1 

3.6  7,12-dimethylbenz[a]anthracene  (DMBA)  1 

3.7  N-ethyl-N- (4- hydroxy  butyl)  nit  rosamine  (EHBN)  1 

3.8  folpet  1 

3.9  hydrogen  peroxide  1 

3.11  3-methylcholanthrene  (3-MC)  1 

3.12  N-methyl-N’-nitro-N-nitrosoguanidine  (MNNG)  1 

3.13  N-methyl-N-nitrosourea  (MNU)  1 

3.14  8-nitroquinoline  1 

3.17  streptozotocin  1 

3.18  o-toluidine  1 

3.20  6-methylquinoline^  1 

3.21  8-methylquinoline  1 

3.22  nitrofrantoln  1 

3.23  6-nitroquinoline  1 

3.24  quercetin  1 

3.32  acetaldehyde  0 

3.33  atrazine  0 

3.34  di(2-ethylhexyl)adipate  (DEHA)  0 

3.35  1,1-dimethylhydrazine  0 

3.39  trichloroacetic  acid  0 

3.42  4-acethylaminofluorene  (AAF)  0 

3.43  aspirin  0 

3.44  butylated  hydroxytoluene  (BHT)  0 

3.45  caffeine  0 

3.46  caprolactam  0 

3.47  chenodeoxicholic  acid  0 

3.49  cypermethrin  0 

3.50  deltamethrin  0 

3.51  diltiazem  0 

3.52  dimethylsulfoxide  (DMSO)  0 

3.53  diazinon  0 

3.54  fenvalerate  0 
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Table  1:  Mutagenicity  in  the  Ames  test  for  113  chemicals 


No.“ 

Compound  Name 

Obs.  Ames 
Mutagenicity 

3.55 

glutathione 

0 

3.56 

4-o-hexyl-2,3,6-trimethylhydroquinone  (HTHQ) 

0 

3.58 

lithocolic  acid 

0 

3.59 

d-mannitol 

0 

3.6l 

phenol 

0 

3.64 

propyl  galiate 

0 

3.65 

propylparaben 

0 

3.66 

pyrene 

0 

3.67 

resorcinol 

0 

3.71 

trimorphamide 

0 

The  numbering  scheme  refers  to  the  enumeration  of  the  chemicals 
in  the  presentation  by  Yamaguchi  et  al  [27]  where  the  numeral  be¬ 
fore  the  decimal  place  refers  to  the  table  in  which  the  compound  was 
listed  (see  below)  and  the  numerals  after  the  decimal  refer  to  the 
compounds  location  within  the  table. 

Table  1  -  Association  between  inhibitory  results  in  the  medium-term 
liver  bioassay  (Ito  test)  and  reported  mutagenicity  and  carcinogenic¬ 
ity. 

Table  2  -  Association  between  positive  results  in  the  medium-term 
liver  bioassay  (Ito  test)  and  reported  mutagenicity  and  carcinogenic¬ 
ity. 

Table  3  -  Association  between  negative  results  in  the  medium-term 
liver  bioassay  (Ito  fest)  and  reported  mutagenicity  and  carcinogenic¬ 
ity. 


2.2.  Calculation  of  Topological  Indices.  The  TIs  calculated  for  this  study 
are  listed  in  Table  2  and  include  Wiener  number  [28],  molecular  connectivity  in¬ 
dices  as  calculated  by  Randic  [29]  and  Kier  and  Hall  [4],  frequency  of  path  lengths 
of  varying  size,  information  theoretic  indices  defined  on  distance  matrices  of  graphs 
using  the  methods  of  Bonchev  and  Trinajstic  [30]  as  well  as  those  of  Raychaud- 
hury  et  al  [31],  parameters  defined  on  the  neighborhood  complexity  of  vertices  in 
hydrogen-filled  molecular  graphs  [32]-[34],  and  Balaban’s  J  indices  [35]— [3 T].  The 
majority  of  the  TIs  were  calculated  using  POLLY  2.3  [38].  The  J  indices  were 
calculated  using  software  developed  by  the  authors. 

The  Wiener  index  ( W )  [28],  the  first  topological  index  reported  in  the  chem¬ 
ical  literature,  may  be  calculated  from  the  distance  matrix  D(G)  of  a  hydrogen- 
suppressed  chemical  graph  G  as  the  sum  of  the  entries  in  the  upper  triangular 
distance  submatrix.  The  distance  matrix  D(G)  of  a  nondirected  graph  G  with  n 
vertices  is  a  symmetric  n  x  n  matrix  (dij),  where  d{j  is  equal  to  the  distance  be¬ 
tween  vertices  Vi  and  vj  in  G.  Each  diagonal  element  da  of  D(G)  is  zero.  We  give 
below  the  distance  matrix  D(G i)  of  the  unlabeled  hydrogen-suppressed  graph  G\ 
of  thioacetamide  (Fig.  1): 
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W  is  calculated  as: 
(2.1) 


1 

2 

3 

4 

1 

0 

1 

2 

2 

D(Gi)  =  2 

1 

0 

1 

1 

3 

2 

1 

0 

2 

4 

2 

1 

2 

0 

W  =  l/2'£,dij  = 

E 

h • 

9h 

ij  h 


where  g ^  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h.  Thus 
for  D{G\ ),  W  has  a  value  of  nine. 


S 

ii 

h3c  nh2 

Thioacetamide  Gi 

Figure  1.- Unlabeled,  hydrogen-suppressed  graph  of  thioacet¬ 
amide  (G  i) 

Randic’s  connectivity  index  [29],  and  higher-order  connectivity  path,  cluster, 
path-cluster  and  chain  types  of  simple,  bond  and  valence  connectivity  parameters 
were  calculated  using  the  method  of  Kier  and  Hall  [4].  The  generalized  form  of  the 
simple  path  connectivity  index  is  as  follows: 

(2.2)  hx  =  Yl(ViVi  ■  •  •  Vh+i)”1/2 

where  ,  Uj, . . . , v^+i  are  the  degrees  of  the  vertices  in  the  path  of  length  h.  The 
path  length  parameters  (Ph),  number  of  paths  of  length  h  (h  =  0, 1, . . . ,  10)  in  the 
hydrogen-suppressed  graph,  are  calculated  using  standard  algorithms. 

Information-theoretic  topological  indices  are  calculated  by  the  application  of 
information  theory  on  chemical  graphs.  An  appropriate  set  A  of  n  elements  is 
derived  from  a  molecular  graph  G  depending  upon  certain  structural  characteristics. 
On  the  basis  of  an  equivalence  relation  defined  on  A,  the  set  A  is  partitioned  into  h 

h 

disjoint  subsets  A{  of  order  rii{i  =  1, 2, . . . ,  /i;  X!  ni  =  n)*  A  probability  distribution 

*=i 

is  then  assigned  to  the  set  of  equivalence  classes: 

Ai,  A2,  * .  • ,  Ah 

Pi  i  P2  i  •  •  *  >  Ph 

where  p*  =  rii/n  is  the  probability  that  a  randomly  selected  element  of  A  will  occur 
in  the  ith  subset. 
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Table  2:  Symbols  and  brief  definitions  for  101  topological  indices 


I $  Information  index  for  the  magnitudes  of  distances  between  all  possible 

pairs  of  vertices  of  a  graph 

I™  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance 
matrix  of  a  graph 
/D  "  Degree  complexity 

Hv  Graph  vertex  complexity 

Hd  Graph  distance  complexity 

IC  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 

Iorb  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at 
its  maximum  neighborhood  of  vertices 

O  Order  of  neighborhood  when  ICT  reaches  its  maximum  value  for  the 
hydrogen-filled  graph 

Mi  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 
M2  A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 
neighboring  (connected)  vertices 

ICr  Mean  information  content  or  complexity  of  a  graph  based  on  the  rth(r  = 
0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
SICr  Structural  information  content  for  rth  (r  =  0  —  6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 

CICr  Complementary  information  content  for  rth  (r  =  0  —  6)  order  neighbor¬ 
hood  of  vertices  inAa  hydrogen-filled  graph 
Path  connectivity  index  of  order  h  =  0  —  6 
Cluster  connectivity  index  of  order  h  —  3  —  6 
t'Xch,  Chain  connectivity  index  of  order  h  =  3  -  6 
*Xpc  Path-cluster  connectivity  index  of  order  h  —  4  —  6 
hXb  Bond  path  connectivity  index  of  order  h  =  0  —  6 

/lx£r  Bond  cluster  connectivity  index  of  order  h  =  3  —  6 

\hCh  Bond  chain  connectivity  index  of  order  h  =  3  —  6 
b^pc  Bond  path-cluster  connectivity  index  of  order  h  =  4  —  6 
hXv  Valence  path  connectivity  index  of  order  h  =  0  —  6 

I'Xq  Valence  cluster  connectivity  index  of  order  h  —  3  —  6 

\lh  Valence  chain  connectivity  index  of  order  h  =  3  —  6 
/lXpC  Valence  path-cluster  connectivity  index  of  order  h  =  4  —  6 
Ph  Number  of  paths  of  length  h  =  0  -  10 
J  Balaban’s  J  index  based  on  distance 

JB  Balaban’s  J  index  based  on  bond  types 

Jx  Balaban’s  J  index  based  on  relative  electronegativities 

JY  Balaban’s  J  index  based  on  relative  covalent  radii 


JliattliMM— .  ■  r  ■  ■  -  liitSh 
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The  mean  information  content  of  an  element  of  A  is  defined  by  Shannon’s 
relation  [39]: 

h 

(2.3)  1C  -  -  ^Pilog2pi 

*-l 

The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in  bits. 
The  total  information  content  of  the  set  A  is  then  n  x  IC.  Figure  2  provides  a 
sample  calculation  for  IC\. 

It. is  to  be  noted  that  the  information  content  of  a  graph  G  is  not  uniquely  de¬ 
fined.  It  depends  on  how  the  set  A  is  derived  from  G  as  well  as  on  the  equivalence 
relation  which  partitions  A  into  disjoint  subsets  Ai .  For  example,  when  A  consti¬ 
tutes  the  vertex  set  of  a  chemical  graph  G,  two  methods  of  partitioning  have  been 
widely  used:  a)  chromatic-number  coloring  of  G  where  two  vertices  of  the  same 
color  are  considered  equivalent,  and  b)  determination  of  the  orbits  of  the  automor¬ 
phism  group  of  G  thereafter  vertices  belonging  to  the  same  orbit  are  considered 
equivalent. 

Rashevsky  was  the  first  to  calculate  the  information  content  of  graphs  where 
topologically  equivalent”  vertices  were  placed  in  the  same  equivalence  class  [40]. 
In  Rashevsky’s  approach,  two  vertices  u  and  v  of  a  graph  are  said  to  be  topologically 
equivalent  if  and  only  if  for  each  neighboring  vertex  u,  (z  =  1, 2, . . . ,  k)  of  the  vertex 
u,  there  is  a  distinct  neighboring  vertex  v{  of  the  same  degree  for  the  vertex  v.  While 
Rashevsky  used  simple  linear  graphs  with  indistinguishable  vertices  to  symbolize 
molecular  structure,  weighted  linear  graphs  or  multigraphs  are  better  models  for 
conjugated  or  aromatic  molecules  because  they  more  properly  reflect  the  actual 
bonding  patterns,  i.e.,  electron  distribution. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding  pattern, 
Sarkar  et  ai  [41]  calculated  information  content  of  chemical  graphs  on  the  basis 
of  an  equivalence  relation  where  two  atoms  of  the  same  element  are  considered 
equivalent  if  they  possess  an  identical  first-order  topological  neighborhood.  Since 
properties  of  atoms  or  reaction  centers  are  often  modulated  by  stereo-electronic 
characteristics  of  distant  neighbors,  i.e.,  neighbors  of  neighbors,  it  was  deemed 
essential  to  extend  this  approach  to  account  for  higher-order  neighbors  of  vertices. 
This  can  be  accomplished  by  defining  open  spheres  for  all  vertices  of  a  chemical 
graph.  If  r  is  any  non-negative  real  number  and  v  is  a  vertex  of  the  graph  G,  then 
the  open  sphere  S(v,r)  is  defined  as  the  set  consisting  of  all  vertices  u*  in  G  such 
that  d(v,Vi)  <  r.  Therefore,  S{v,  0)  =  0,  S(v,r)  =  v  for  0  <  r  <  1,  and  S(v,r)  is 
the  set  consisting  of  v  and  all  vertices  u,  of  G  situated  at  unit  distance  from  v,  if 
1  <  r  <  2. 

One  can  construct  such  open  spheres  for  higher  integral  values  of  r.  For  a 
particular  value  of  r,  the  collection  of  all  such  open  spheres  S(v ,  r),  where  v  runs  over 
the  whole  vertex  set  V,  forms  a  neighborhood  system  of  the  vertices  of  G.  A  suitably 
defined  equivalence  relation  can  then  partition  V  into  disjoint  subsets  consisting 
of  vertices  which  are  topologically  equivalent  for  rth  order  neighborhood.  Such 
an  approach  has  been  developed  and  the  information-theoretic  indices  calculated 
based  on  this  idea  are  called  indices  of  neighborhood  symmetry  [34]. 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two 
vertices  uQ  and  vQ  of  a  molecular  graph  are  said  to  be  equivalent  with  respect  to 
rth  order  neighborhood  if  and  only  if  corresponding  to  each  path  u0,  Ui , . . . ,  ur  of 


Subsets: 

I  II  III  IV  V  VI 

(h,-h3)  (h4-h5)  c6  c7  n8  s9 


Probability: 

* 

I  II  III  IV  V  VI 

3/9  2/9  1/9  1/9  1/9  1/9 


1C,  =  4  *  1/9  *  Log2  9  +  2/9  *  Log2  9/2  +  3/9  *  Log2  9/3 


=  2.419  bits 


SIC,  =  IC,/Log2  9 


=  0.763  bits 


CIC,  =  Log2  12  -  IC2 


=  0.751  bits 


Figure  2.  Labeled,  hydrogen-filled  graph  of  thioacetamide  ( G2 ) 
and  sample  calculations  for  JCi,  SIC\  and  CICX 


length  r,  there  is  a  distinct  path  v0,vl,...,vT  of  the  same  length  such  that  the 
paths  have  similar  edge  weights,  and  both  u0  and  va  are  connected  to  the  same 
number  and  type  of  atoms  up  to  the  rth  order  bonded  neighbors.  The  detailed 
equivalence  relation  has  been  described  in  earlier  studies  (34,  42]. 
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Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood 
is  completed,  ICr  is  calculated  by  Eq.  (2.2).  Basak  et  ai  [32]  defined  another 
information-theoretic  measure,  structural  information  content  (S/Cr),  which  is  cal¬ 
culated  as: 

(2.4)  SICr  =  ICr  /  log2  n 

where  ICr  is  calculated  from  Eq.(2.2)  and  n  is  the  total  number  of  vertices  of  the 
graph. 

Another  information-theoretic  invariant,  complementary  information  content 
(CICr)  [43],  is  defined  as: 

(2.5)  CICr  —  log2  n  —  ICr 

CICr  represents  the  difference  between  maximum  possible  complexity  of  a 
graph  (where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the  realized 
topological  information  of  a  chemical  species  as  defined  by  ICr  •  Sample  calculations 
for  SI Ci  and  CIC\  have  been  included  in  Figure  2. 

The  information-theoretic  index  on  graph  distance,  1$  is  calculated  from  the 
distance  matrix  D(G )  of  a  chemical  graph  G  as  follows  [30]: 

(2.6)  1%  =  W  log2  W  -  Y,  ■  h  log2  h 

h 

The  mean  information  index,  ,  is  found  by  dividing  the  information  index 
Jp’  by  W.  The  information  theoretic  parameters  defined  on  the  distance  matrix, 
HD  and  Hv ,  were  calculated  by  the  method  of  Raychaudhury  et  ai  [31]. 

Balaban  defined  a  series  of  indices  based  upon  distance  sums  within  the  distance 
matrix  for  a  chemical  graph  that  he  designated  as  J  indices  [35]-[37].  These 
indices  are  highly  discriminating  with  low  degeneracy.  Unlike  Wt  the  J  indices 
range  of  values  are  independent  of  molecular  size.  The  general  form  of  the  J  index 
calculation  is  as  follows: 

(2.7)  J  =  g(M+l)-1  £  (.<«i)"1/a 

ij,  edges 

where  the  cyclomatic  number  \x  (or  number  of  rings  in  the  graph)  is  fi  =  q  —  n  +  1, 
with  q  edges  and  n  vertices  and  Sj  is  the  sum  of  the  distances  of  atom  i  to  all 
other  atoms  and  Sj  is  the  sum  of  the  distances  of  atom  j  to  all  other  atoms  [35]. 
Variants  were  proposed  by  Balaban  for  incorporating  information  on  bond  type, 
relative  electronegativities,  and  relative  covalent  radii  [36,  37]. 

2.3.  Calculation  of  Atom  Pairs.  Atom  pairs  (APs)  were  calculated  using 
the  method  of  Carhart  et  ai  [3].  An  atom  pair  is  defined  as  a  substructure  consisting 
of  two  non-hydrogen  atoms  i  and  j  and  their  interatomic  separation: 

<  atom  descriptor  >  -  <  separation  >  -  <  atom  descriptor^  > 

where  <  atom  descriptor  >  contains  information  about  the  atomic  type,  number  of 
non-hydrogen  neighbors  and  the  number  of  7 r  electrons.  The  interatomic  separation 
of  two  atoms  is  the  number  of  atoms  traversed  in  the  shortest  bond-by-bond  path 
containing  both  atoms.  APs  used  in  this  study  were  calculated  by  the  APProbe 
software  [43]. 
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2.4.  Statistical  Methods  and  Computation  of  Intermolecular 
Similarity. 

2*4. 1.  Data  Reduction .  Initially,  all  TIs  were  transformed  by  the  natural  log¬ 
arithm  of  the  index  plus  one.  This  was  done  since  the  scale  of  some  TIs  may  be 
several  orders  of  magnitude  greater  than  other  TIs. 

A  principal  component  analysis  (PC A)  was  used  on  the  transformed  indices  to 
minimize  the  intercorrelation  of  indices.  The  PCA  was  conducted  using  the  SAS 
procedure  PRINCOMP  [44].  The  PCA  produces  linear  combinations  of  the  TIs, 
called  principal  components  (PCs)  which  are  derived  from  the  correlation  matrix. 
The  first  PC  has  the  largest  variance,  or  eigenvalue,  of  the  linear  combination 
of  TIs.  Each  subsequent  PC  explains  the  maximal  index  variance  orthogonal  to 
the  previous  PCs,  eliminating  any  redundancies  that  could  occur  within  the  set 
of  TIs.  The  maximum  number  of  PCs  generated  is  equal  to  the  number  of  TIs 
available.  For  the  purposes  of  this  study,  only  PCs  with  eigenvalues  greater  than 
one  were  retained.  A  more  detailed  explanation  of  this  approach  has  been  provided 
in  a  previous  study  by  Basak  et  a/.[13].  These  PCs  were  subsequently  used  to 
determine  similarity  scores  as  described  below. 

2.4.2.  Similarity  Measures.  Intermolecular  similarity  was  measured  using  two 
distinct  methods.  The  AP  method  uses  an  associative  measure  described  by  Carhart 
et  al  [3]  and  is  based  on  atom  pair  descriptors.  The  measurement  is  the  ratio  of 
the  number  of  shared  atom  pairs  between  two  molecules  over  the  total  number  of 
atom  pairs  present  in  the  two  molecules.  Similarity  (5)  between  molecules  i  and  j 
is  defined  as: 


(2*8)  S{j  =  2C/{Ti  4-  Tj) 

where  C  is  the  number  of  atom  pairs  common  to  molecule  i  and  j.  T{  and  Tj  are 
the  total  number  of  atom  pairs  in  molecule  i  and  j,  respectively.  The  numerator 
is  multiplied  by  a  factor  of  2  to  reflect  the  presence  of  shared  atom  pairs  in  both 
compounds.  * 

The  second  similarity  method,  Euclidean  distance  (ED)  within  an  n-dimensional 
PC  space  derived  from  TIs  was  used.  ED  between  molecules  i  and  j  is  defined  as: 


(2*9) 


ED{j 


U=i 


where  n  equals  the  number  of  dimensions  or  PCs  retained  from  the  PCA.  Dik  and 
Djk  the  data  values  of  the  k^1  dimension  for  molecules  i  and  j ,  respectively. 

2.4.3.  Analog  /  K -Nearest  Neighbor  Selection.  Following  the  quantification  of 
intermolecular  similarity  of  the  molecules,  analogs  or  nearest  neighbors  are  deter¬ 
mined  on  the  basis  of  both  S  and  ED.  In  the  case  of  the  AP  method,  two  molecules 
are  considered  identical  if  S  =  1,  while  they  have  no  atom  pairs  in  common  if  5  =  0. 
The  ED  method  measures  a  distance  between  molecules,  thus  the  lower  the  value 
of  ED  the  greater  the  similarity  between  two  molecules. 

2.4.4.  Property  Estimation.  Since  the  data  presented  in  the  work  of  Yamaguchi 
et  al.  [27]  represented  mutagenicity  as  non-mutagen  (-)  or  mutagen  (+)  this  data 
was  treated  as  a  zero-one  relationship,  where  non-mutagens  have  a  value  of  zero 
and  mutagens  have  a  value  of  one.  In  estimating  the  mutagenicity  of  the  probe 
compound,  the  mean  of  the  observed  mutagenicity  of  the  Af-nearest  neighbors  was 
used  as  the  estimate.  Thus,  if  the  mean  resulted  in  a  value  greater  than  0.5,  the 


20 


S  C.  DASAK  AND  D.D.  GUTE 


compound  was  classified  as  a  mutagen.  However,  if  the  mean  was  equal  to  0.5,  the 
compound  was  not  classified  as  the  results  were  inconclusive. 

3.  Results 

3.1.  Principal  Component  Analysis.  From  the  PCA  of  the  102  TIs,  eight 
PCs  with  eigenvalues  greater  than  one  were  retained.  These  eight  PCs  explained, 
cumulatively,  95.2%  of  the  total  variance  within  the  TI  data.  Table  3  lists  the 
eigenvalues  of  the  eight  PCs,  the  proportion  of  variance  explained  by  each  PC,  the 
cumulative  variance  explained,  and  the  two  TIs  most  correlated  with  each  individual 
PC. 


Table  3.  Eigenvalues,  variance  explained  and  two  TIs  most  cor¬ 
related  with  the  eight  principal  components 


PC 

Percent 
variance 
Eigenvalue  explained 

Cumulative 

variance 

explained 

First  most 
correlated  TI 

Second  most 
correlated  TI 

PC ! 

55.52 

54.97 

54.97 

V 

(96.5%) 

3x 

(96.4%) 

PC2 

12.38 

12.26 

67.23 

SIC3 

(86.4%) 

sic4 

(85.5%) 

PCs 

11.73 

11.61 

78.84 

5V6 

Xch 

(77.3%) 

YCh 

(76.1%) 

PC4 

6.78 

6.71 

85.55 

ICo 

(55.0%) 

Ych 

(49.7%) 

PCs 

4.60 

4.55 

90.10 

J 

(68.9%) 

JY 

(62.4%) 

PCs 

2.35 

2.32 

92.43 

ICo 

(-47.2%) 

SICo 

(-36.4%) 

PC7 

1.65 

1.63 

94.06 

Yc 

(44.4%) 

Yc 

(43.5%) 

PCs 

1.16 

1.14 

95.21 

% 

(-34.6%) 

Yc 

(23.0%) 

3.2.  Analog  Selection.  Figure  3  shows  the  results  of  the  analog  selection  for 
isospectral  graph  10.1.1  using  atom  pairs  to  derive  a  similarity  space  and  PCs  to 
derive  a  Euclidean  distance  space.  The  first  five  analogs  (neighbors)  for  the  probe 
compound,  10.1.1,  are  presented  for  each  of  the  similarity  methods. 

3.3.  A-Nearest  Neighbor  Estimation.  Table  4  presents  the  results  for  the 
prediction  of  mutagenicity  for  the  113  molecules  over  a  range  of  A  values  ( A  =  1—5) 
for  both  the  AP  and  ED  methods.  The  results  are  presented  as  percent  correctly 
classified  and  over-all  percent  correct  prediction  rates  are  provided  as  a  means  of 
comparing  the  efficacy  of  the  individual  models.  The  variability  between  the  K 
levels  is  easily  explained  by  the  problematic  nature  of  using  a  binary  relationship 
such  as  this  one  in  estimation.  When  the  number  of  neighbors  was  even,  the 
potential  for  unclassified  compounds  led  to  lower  prediction  rates  than  in  the  case 
of  an  odd  number  of  neighbors. 


4.  Discussion 

The  major  objective  of  this  paper  was  to  study  the  effectiveness  of  mathemat¬ 
ical  invariants  in  the  characterization  of  molecular  structure  and  the  estimation  of 
the  toxicity  of  chemicals.  An  invariant  maps  a  chemical  structure  into  the  set  R  of 
real  numbers.  A  specific  invariant  may  be  used  for  the  ordering  or  partial  ordering 
of  sets  of  molecules  or  in  structure-activity  relationship  studies  [45].  A  particular 
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S=0.86 


Euclidean 

Distance 

Method 


Euclidean 

Distance 


ED=0.20 


Figure  3.  Analogs  selected  for  isospectral  graph  10.1.1 


Table  4.  KNN  results  for  the  prediction  of  mutagenicity  for  113  chemicals 


Percent  Negative 
Correct 

Percent  Positive 
Correct 

Total  Percent 
Correct 

K 

AP 

ED 

AP 

ED 

AP 

ED 

i 

73.5 

75.0 

84.1 

66.7 

77.7 

71.7 

2 

66.2 

64.7 

72.7 

33.3 

68.8 

52.2 

3 

77.9 

80.9 

88.6 

53.3 

82.1 

69.9 

4 

70.6 

69.1 

77.3 

42.2 

73.2 

58.4 

5 

79.4 

77.9 

86.4 

53.3 

82.1 

68.1 

structural  invariant  quantifies  distinct  aspects  of  molecular  structure.  Therefore,  a 
combination  of  such  indices  might  be  more  powerful  in  the  mathematical  charac¬ 
terization  of  molecular  structure  as  compared  to  the  use  of  one  specific  invariant. 
The  problem  arises  out  of  the  fact  that  often  the  various  graph  theoretic  indices 
of  molecular  structures  are  strongly  correlated.  We  have  attempted  to  resolve  this 
problem  through  the  implementation  of  a  PCA  to  derive  orthogonal  variables  from 
a  large  set  of  calculated  TIs,  and  using  the  orthogonal  parameters  in  the  charac¬ 
terization  of  structure  [10,  12,  15,  17,  18,  22,  23]. 

In  the  present  study  we  have  used  calculated  atom  pairs  and  principal  com¬ 
ponents  derived  from  TIs  to  select  structural  analogs  for  a  probe  compound  from 
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a  diverse  set  containing  closely  related  structures.  The  result  of  this  analog  selec¬ 
tion,  depicted  in  Figure  3,  shows  that  the  five  neighbors  selected  by  each  of  the 
methods  exhibit  sufficient  power  to  reject  dissimilar  structures.  In  other  words, 
we  may  conclude  that  both  the  atom  pair  and  Euclidean  distance  methods  are  ca¬ 
pable  of  choosing  similar  molecules  from  a  collection  of  structurally  diverse  struc¬ 
tures.  This  is  in  line  with  our  earlier  studies  with  various  diverse  sets  of  molecules 
[10,  12,  15,  17,  18,  22,  23]. 

The  central  paradigm  of  QSAR  holds  that  similar  structures  usually  have  sim¬ 
ilar  properties.  To  test  this  idea,  we  selected  K- nearest  neighbors  ( K  =  1  —  5)  for 
each  molecule  from  a  set  of  113  mutagens  and  non-mutagens  using  the  ED  and 
AP  methods  and  used  the  selected  nearest  neighbors  in  estimating  mutagenicity. 
The  results  in  Table  4  show  that  both  methods  lead  to  reasonably  good  estimates, 
although  the  AP  method  was  superior  to  the  ED  method. 

In  conclusion,  both  the  ED  and  AP  methods,  based  on  calculated  graph  theo¬ 
retic  structural  invariants,  did  reasonably  well  in  the  selection  of  structural  analogs 
and  in  the  estimation  of  chemical  properties  based  on  nearest  neighbors. 
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Hierarchical  quantitative  structure— activity  relationships  (H-QSAR)  have  been  developed  as  a  new  approach 
in  constructing  models  for  estimating  physicochemical,  biomedicinal,  and  toxicological  properties  of  interest. 
This  approach  uses  increasingly  more  complex  molecular  descriptors  in  a  graduated  approach  to  model 
building.  In  this  study,  statistical  and  neural  network  methods  have  been  applied  to  the  development  of 
H-QSAR  models  for  estimating  the  acute  aquatic  toxicity  (LC50)  of  69  benzene  derivatives  to  Pimephales 
promelas  (fathead  minnow).  Topostructural,  topochemical,  geometrical,  and  quantum  chemical  indices  were 
used  as  the  four  levels  of  the  hierarchical  method.  It  is  clear  from  both  the  statistical  and  neural  network 
models  that  topostructural  indices  alone  cannot  adequately  model  this  set  of  congeneric  chemicals.  Not 
surprisingly,  topochemical  indices  greatly  increase  the  predictive  power  of  both  statistical  and  neural  network 
models.  Quantum  chemical  indices  also  add  significantly  to  the  modeling  of  this  set  of  acute  aquatic  toxicity 
data. 


1.  INTRODUCTION 

An  important  aspect  of  modem  toxicology  research  is  the 
prediction  of  toxicity  of  xenobiotics  and  environmental 
pollutants  from  their  molecular  structure.1-13  The  potential 
toxicity  of  a  chemical  is  normally  assessed  on  the  basis  of  a 
wide  variety  of  relevant  physical  and  biological  properties. 
Table  1  provides  a  partial  list  of  such  properties.  Risk 
assessors  use  these  kinds  of  toxicological  indicators  to 
estimate  the  potential  risk  posed  by  a  given  compound,  using 
simpler  properties  relevant  to  a  chemical’s  toxicity  to  make 
more  complex  assessments  relevant  to  human  and  environ¬ 
mental  health.  However,  the  Toxic  Substances  Control  Act 
(TSCA)  Inventory  currendy  includes  about  80  000  chemicals, 
most  of  which  do  not  have  data  for  the  toxicoiogically 
relevant  properties  mentioned  in  Table  1.  In  fact,  roughly 
50%  of  these  chemicals  do  not  have  any  experimental 
property  data  at  all.14  Worldwide,  more  than  16.7  million 
distinct  organic  and  inorganic  chemicals  are  known,  as  is 
evident  from  the  number  of  entries  in  the  Chemical  Abstract 
Service  (CAS)  inventory.15  For  many  of  these  chemicals  we 
do  not  have  the  data  necessary  for  risk  assessment.  Ad¬ 
ditionally,  modem  combinatorial  chemistry  techniques  have 
led  to  the  production  of  vast  libraries  of  chemicals  at  a  very 
rapid  rate.  Most  of  these  substances  have  none  of  the  test 
data  needed  for  their  hazard  estimation. 

Recently  there  have  been  efforts  by  the  chemical  industry 
and  government  agencies  to  develop  reliable  databases  of 
properties  that  will  be  used  for  hazard  estimation.16  This 
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Table  1.  Physicochemical  and  Biological  Properties  Relevant  to  the 
Assessment  of  toxicity 


physicochemical 

biological 

molar  volume 

receptor  binding  (KD) 

boiling  point 

Michaelis  constant  ( Km ) 

melting  point 

inhibitor  constant  ( K\ ) 

vapor  pressure 

biodegradation 

aqueous  solubility 

bioconcentration 

dissociation  constant  (pKJ 

alkylation  profile 

partition  coefficient 

metabolic  profile 

octanol -water  (log  P ) 

chronic  toxicity 

air— water 

carcinogenicity 

sediment— water 

mutagenicity 

reactivity  (electrophile) 

acute  toxicity 

LD50 

LCso 

effort,  although  commendable,  falls  short  of  the  need;  and 
the  picture  will  remain  so  in  the  foreseeable  future.  In  the 
area  of  molecular  biology,  innovative  techniques  are  emerg¬ 
ing  where  specially  engineered  cell  lines  can  be  used  to  detect 
the  activity  or  toxicity  of  chemicals  to  the  genetic  system.17-19 
Effects  of  chemicals  on  the  pattern  of  cellular  proteins, 
analyzed  by  proteomics  technology,  are  being  used  to  detect 
their  potential  toxic  effects.20-22  Such  methods  are  faster  than 
the  traditional  in  vivo  test  methods,  and  it  is  possible  that 
they  could  be  developed  to  the  point  where  they  will  replace 
or  significantly  decrease  the  need  for  whole-animal  screening 
methods.  At  present,  neither  the  available  test  data  nor  the 
combination  of  in  vitro  toxicity  testing  methods  provides 
adequate  resources  for  hazard  assessment. 

Quantitative  structure— activity/- toxicity  relationship 
(QSAR/QSTR)  models  have  emerged  as  useful  tools  to 
handle  the  data  gap  in  toxicology  and  pharmacology.1-13’22-26 
QSAR  models  can  be  used  to  estimate  complex  properties 
of  chemicals  from  simpler  experimental  or  computed  proper- 
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ties.  In  view  of  the  fact  that  most  chemicals  in  commerce 
and  environmental  pollutants  have  very  little  test  data,  it 
would  be  desirable  if  we  could  develop  toxicologically 
relevant  QSARs  from  properties  that  can  be  calculated 
directly  from  a  chemical’s  structure.  In  some  of  our  recent 
papers  we  have  developed  a  novel  hierarchical  QSAR  (H- 
QSAR)  approach  where  four  classes  of  theoretical  molecular 
descriptors,  viz.,  topostructural,  topochemical,  geometrical, 
and  quantum  chemical  parameters,  have  been  used  sequen¬ 
tially  in  the  formulation  of  H-QSAR  models  for  predicting 
physical,  biomedicinal,  and  toxicological  properties.1 3*6’8,23-26 

Most  of  our  H-QSARs  are  based  on  linear  statistical 
methods  such  as  multiple  linear  regression,  principal  com¬ 
ponents  analysis  (PCA),  and  variable  clustering.  Such 
methods  yield  useful  models,  but  they  suffer  from  the 
limitation  that  in  some  cases  the  relationship  between  a 
molecular  descriptor  and  toxicity  may  be  intrinsically 
nonlinear.  In  such  cases,  the  use  of  linear  statistical  methods 
may  not  result  in  the  best  models.  Therefore,  in  this  paper, 
we  have  carried  out  a  comparative  study  of  multiple 
regression  vis-a-vis  neural  net  methods  in  predicting  the  acute 
aquatic  toxicity  (LC5o)  of  a  set  of  69  benzene  derivatives. 

2.  METHODS 

2.1.  Toxicity  Database.  The  utility  of  this  approach  of 
generating  numerous  hierarchical  theoretical  descriptors  of 
compounds  was  tested  on  a  set  of  acute  aquatic  toxicity 
(LC50)  data  for  69  benzene  derivatives.  The  data  were  taken 
from  a  study  by  Hall  et  al.,12  who  collected  acute  aquatic 
toxicity  data  measured  in  fathead  minnow  ( Pimephales 
promelas).  These  data  were  compiled  from  eight  other 
literature  sources  and  included  some  original  work  which 
was  conducted  at  the  U.  S.  Environmental  Protection  Agency 
Environmental  Research  Laboratory  (USEPA-ERL)  in  Du¬ 
luth,  MN.  This  set  of  chemicals  was  composed  of  benzene 
and  68  substituted  benzene  derivatives.  According  to  the 
authors,  these  benzene  derivatives  were  tested  using  meth¬ 
odologies  comparable  to  their  own  96-h  fathead  minnow 
toxicity  test  system.  The  derivatives  chosen  for  this  study 
(see  Table  2)  have  seven  different  substituent  groups  that 
are  present  in  at  least  six  of  the  molecules:  chloro-,  bromo-, 
nitro-,  methyl-,  methoxyl-,  hydroxyl-,  and  amino-. 

2.2.  Calculation  of  Topological  Indices.  The  complete 
set  of  topological  indices  (TIs)  used  in  this  study,  both 
topostructural  and  topochemical,  have  been  calculated  using 
POLLY  2.3  and  other  software  developed  by  Basak  et  al.27 
These  indices  include  the  Wiener  index,28  the  connectivity 
indices  developed  by  Randid,29  higher  order  connectivity 
indices  formulated  by  Kier  and  Hall,30  bonding  connectivity 
indices  defined  by  Basak  et  al.,31  a  set  of  information 
theoretic  indices  defined  on  the  distance  matrices  of  simple 
molecular  graphs,3233  a  set  of  parameters  derived  on  the 
neighborhood  complexity  of  hydrogen-filled  molecular 
graphs,34-36  and  Balaban’s  J  indices.37-39  Table  3  provides 
the  symbols  of  the  topological  indices  and  brief  definitions. 

The  set  of  TIs  was  divided  into  two  distinct  subsets: 
topostructural  indices  (TSI)  and  topochemical  indices  (TCI). 
TSIs  are  topological  indices  which  encode  information  about 
the  adjacency  and  distances  of  atoms  (vertices)  in  molecular 
structures  (graphs)  irrespective  of  the  chemical  nature  of  the 
atoms  involved  in  the  bonding  or  factors  such  as  hybridiza- 


Table  2.  Experimental  and  Estimated  Acute  Aquatic  Toxicity  Data 
for  69  Benzene  Derivatives,  Expressed  as  —  log(LCso)  for  the 
Linear  Regression  Model  (LR)  and  the  Neural  Network  Model 
Using  the  23  Parameters  Selected  by  Variable  Clustering 


compound 

expt 

LR 

NN 

benzene 

3.40 

3.42 

3.65 

bromobenzene 

3.89 

3.77 

3.79 

chlorobenzene 

3.77 

3.75 

3.77 

phenol 

3.51 

3.38 

3.51 

toluene 

3.32 

3.66 

3.62 

1,2-dichlorobenzene 

4.40 

4.29 

4.30 

1 ,3-dichlorobenzene 

4.30 

4.37 

4.12 

1 ,4-dichlorobenzene 

4.62 

4.51 

4.27 

2-chlorophenol 

4.02 

3.79 

3.91 

3-chlorotoluene 

3.84 

3.88 

3.79 

4-chlorotoluene 

4.33 

3.87 

3.76 

1 ,3-dihydroxybenzene 

3.04 

3.43 

3.53 

3-hydroxyanisole 

3.21 

3.33 

3.45 

2-methylphenol 

3.77 

3.64 

3.67 

3-methylphenol 

3.29 

3.60 

3.58 

4-methylphenol 

3.58 

3.53 

3.55 

4-nitrophenol 

3.36 

3.61 

3.76 

1 ,4-dimethoxybenzene 

3.07 

3.28 

3.51 

1 ,2-dimethylbenzene 

3.48 

3.93 

3.91 

1 ,4-dimethylbenzene 

4.21 

3.87 

3.68 

2-nitrotoluene 

3.57 

3.66 

3.81 

3-nitrotoluene 

3.63 

3.53 

3.71 

4-nitrotoluene 

3.76 

3.49 

3.68 

1,2-dinitrobenzene 

5.45 

5.24 

4.99 

1,3-dinitrobenzene 

4.38 

4.18 

4.19 

1,4-dinitrobenzene 

5.22 

4.94 

4.85 

2-methyl-3-nitroaniline 

3.48 

3.79 

3.88 

2-methyl-4-nitroaniline 

3.24 

3.51 

3.75 

2-methyl-5-nitroaniIine 

3.35 

3.68 

3.86 

2-methyl-6-nitroaniline 

3.80 

3.84 

3.79 

3-methyl-6-nitroaniline 

3.80 

3.78 

3.62 

4-methyl-2-nitroaniline 

3.79 

3.80 

3.66 

4-hydroxy-3-nitroaniline 

3.65 

3.61 

3.58 

4-methyl-3-nitroaniline 

3.77 

3.73 

3.72 

1 ,2,3-trichlorobenzene 

4.89 

4.89 

5.04 

1 ,2,4-trichlorobenzene 

5.00 

5.04 

4.83 

1 ,3,5-trichlorobenzene 

4.74 

5.11 

4.78 

2,4-dichlorophenol 

4.30 

4.33 

4.47 

3,4-dichlorotoluene 

4.74 

4.26 

4.28 

2,4-dichlorotoluene 

4.54 

4.36 

4.44 

4-chloro-3-methylphenol 

4.27 

3.87 

4.07 

2,4-dimethylphenol 

3.86 

3.76 

3.72 

2,6-dimethylphenol 

3.75 

3.80 

3.84 

3,4-dimethylphenol 

3.90 

3.80 

3.79 

2,4-dinitrophenol 

4.04 

4.14 

4.01 

1 ,2,4-trimethylbenzene 

4.21 

4.09 

3.87 

2,3-dinitrotoluene 

5.01 

5.20 

5.28 

2,4-dinitrotoIuene 

3.75 

4.10 

4.33 

2,5-dinitrotoluene 

5.15 

4.84 

4.72 

2,6-dinitrotoluene 

3.99 

4.41 

4.63 

3,4-dinitrotoIuene 

5.08 

5.11 

5.09 

3,5-dinitrotoluene 

3.91 

4.05 

4.16 

1 ,3,5-trinitrobenzene 

5.29 

5.37 

5.32 

2-methyl-3,5-dinitroaniline 

4.12 

4.13 

4.23 

2-methyl-3,6-dinitroaniline 

5.34 

4.80 

4.54 

3-methyi-2,4-dinitroaniline 

4.26 

4.28 

4.20 

5 -methyl-2, 4-dinitroaniline 

4.92 

4.14 

4.02 

4-methyl-2,6-dinitroaniline 

4.21 

4.67 

4.58 

5-methyl-2,6-dinitroaniline 

4.18 

4.80 

4.78 

4-methyl-3,5-dinitroaniline 

4.46 

4.34 

4.43 

2,4,6-tribromophenol 

4.70 

4.89 

5.47 

1 ,2,3,4-tetrachlorobenzene 

5.43 

5.62 

5.56 

1 ,2,4,5-tetrachlorobenzene 

5.85 

5.80 

5.61 

2,4,6-trichlorophenol 

4.33 

4.79 

4.96 

2-methyl-4,6-dinitrophenoI 

5.00 

4.21 

4.16 

2,3,6-trinitrotoluene 

6.37 

6.36 

5.81 

2,4,6-trinitrotoluene 

4.88 

5.16 

5.42 

2,3,4,5-tetrachlorophenol 

5.72 

5.36 

5.58 

2,3,4,5,6-pentachlorophenol 

6.06 

6.03 

5.83 
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Table  3.  Symbols,  Definitions,  and  Classifications  of  Topological,  Geometrical,  and  Quantum  Chemical  Parameters 

Topostructural 

/Dw  information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertexes  of  a  graph 

/Dw  mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a  graph 

P  degree  complexity 

Hv  graph  vertex  complexity 

HD  graph  distance  complexity 

IQ  information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences  of  distance  h 

O  order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen-filled  graph 

M j  a  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertexes 

M“i  a  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring  (connected)  vertexes 

hX  path  connectivity  index  of  order  h  =  0—6 

hXc  cluster  connectivity  index  of  order  h  =  3,5 

hXch  chain  connectivity  index  of  order  h  =  6 

* Xpc  path-cluster  connectivity  index  of  order  h  =  4-6 

Ph  no.  of  paths  of  length  h  =  0- 10 

J  "  Balaban’s  J  index  based  on  distance 


Topochemical 

/orb  information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its  maximum  neighborhood  of  vertexes 

ICr  mean  information  content  or  complexity  of  a  graph  based  on  the  rth  ( r  =  0—6)  order  neighborhood  of  vertexes 

in  a  hydrogen-filled  graph 

SlCr  structural  information  content  for  Ph  ( r  =  0-6)  order  neighborhood  of  vertexes  in  a  hydrogen-filled  graph 

CICt  complementary  information  content  for  rth  (r  =  0—6)  order  neighborhood  of  vertexes  in  a  hydrogen-filled  graph 

hXb  bond  path  connectivity  index  of  order  h  =  0-6 

Yc  bond  cluster  connectivity  index  of  order  h  —  3,  5 

Ych  bond  chain  connectivity  index  of  order  h  =  6 

Vpc  bond  path-cluster  connectivity  index  of  order  h  =  4-6 

Y  valence  path  connectivity  index  of  order  h  =  0-6 

Yc  valence  cluster  connectivity  index  of  order  h  —  3,5 

Yai  valence  chain  connectivity  index  of  order  h  =  6 

Ypc  valence  path-cluster  connectivity  index  of  order  h  =  4-6 

J B  Balaban’s  J  index  based  on  bond  types 

Jx  Balaban’s  J  index  based  on  relative  electronegativities 

JY  Balaban’s  J  index  based  on  relative  covalent  radii 

Geometrical 

V\v  van  der  Waals  volume 

iDW  3D  Wiener  no.  for  the  hydrogen -suppressed  geometric  distance  matrix 

3DWh  3D  Wiener  no.  for  the  hydrogen-filled  geometric  distance  matrix 

Quantum  Chemical 

Ehomo  energy  of  the  highest  occupied  molecular  orbital 

Ehomoi  energy  of  the  second  highest  occupied  molecular  orbital 

Elumo  energy  of  the  lowest  unoccupied  molecular  orbital 

Elumoi  energy  of  the  second  lowest  unoccupied  molecular  orbital 

A H/  heat  of  formation 

p  dipole  moment 


tion  states  of  atoms  and  number  of  core/valence  electrons 
in  individual  atoms.  TCIs  are  parameters  that  quantify 
information  regarding  the  topology  (connectivity  of  atoms), 
as  well  as  specific  chemical  properties  of  the  atoms  and 
bonds  comprising  a  molecule.  TCIs  are  derived  from 
weighted  molecular  graphs  where  each  vertex  (atom)  is 
properly  weighted  with  relevant  chemical/physical  properties. 
Table  3  shows  the  division  of  the  topological  indices  into 
topostructural  and  topochemical  indices. 

23,  Calculation  of  Geometrical  Indices.  The  geometrical 
indices  include  the  three-dimensional  (3D)  Wiener  numbers 
for  hydrogen-filled  and  hydrogen-suppressed  molecular 
structures  and  van  der  Waals  volume,  van  der  Waals  volume, 
Vw>  was  calculated  using  SYBYL  6.4  from  Tripos  Associ¬ 
ates,  Inc.40  The  3D  Wiener  numbers  were  calculated  by 
SYBYL  using  an  SPL  (Sybyl  Programming  Language) 
program  developed  in  our  laboratory.  Calculation  of  the  3D 
Wiener  numbers  consists  of  the  sum  entries  in  the  upper 
triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3D  coordinates  for  the  atoms 
were  determined  using  CONCORD  3.2. 1.41  The  symbols  and 
definitions  of  the  geometrical  indices  are  included  in  Table 
3. 


2.4.  Quantum  Chemical  Parameters.  Quantum  chemical 
parameters  were  calculated  using  the  Austin  Model  version 
one  (AMI)  semiempirical  Hamiltonian.  These  parameters 
were  calculated  using  MOP  AC  6.00  in  the  SYBYL  inter¬ 
face.42  Brief  definitions  and  symbols  for  the  quantum 
chemical  parameters  used  in  this  study  are  included  in  Table 
3. 

2.5.  Statistical  Analysis  and  Hierarchical  QSAR.  Ini¬ 
tially,  all  topological  indices  were  transformed  by  the  natural 
logarithm  of  the  index  plus  one.  This  was  done  to  scale  the 
indices,  since  some  may  be  several  orders  of  magnitude 
greater  than  others,  while  other  indices  may  equal  zero.  The 
geometric  indices  were  transformed  by  the  natural  logarithm 
of  the  index  for  consistency;  the  addition  of  one  was 
unnecessary. 

The  set  of  86  topological  indices  was  then  partitioned  into 
the  two  distinct  sets:  topostructural  indices  (35)  and  to¬ 
pochemical  indices  (51).  The  sets  of  topostructural  and 
topochemical  indices  were  then  divided  into  subsets,  or 
clusters,  based  on  the  correlation  matrix  using  the  SAS 
variable  clustering  procedure  (VARCLUS)43  to  further  reduce 
the  number  of  independent  variables  for  use  in  model 
construction.  This  procedure  divides  the  set  of  indices  into 


888  J.  Chem.  Inf.  Comput.  Sci.,  Vol.  40,  No.  4,  2000 

disjoint  clusters,  such  that  each  cluster  is  essentially  unidi¬ 
mensional. 

From  each  cluster,  the  index  most  correlated  with  the 
cluster  was  selected  for  modeling,  as  well  as  any  indices 
that  were  poorly  correlated  with  their  cluster  ( R 2  <  0.70). 
These  indices  were  then  used  in  the  modeling  of  the  acute 
aquatic  toxicity  of  benzene  derivatives  in  fathead  minnow. 
The  variable  clustering  and  selection  of  indices  was  per¬ 
formed  independently  for  both  the  topostructural  and  to- 
pochemical  indices.  This  procedure  resulted  in  a  set  of  five 
topostructural  indices  and  a  set  of  nine  topochemical  indices. 

Reducing  the  number  of  independent  variables  is  critical 
when  attempting  to  model  small  data  sets  using  linear 
statistical  methods.  The  smaller  the  data  set,  the  greater  the 
chance  of  spurious"  error  when  using  a  large  number  of 
independent  variables  (descriptors).  A  study  by  Topliss  and 
Edwards44  has  shown  that  for  a  set  with  about  70  dependent 
variables  (observations),  no  more  than  40  independent 
variables  may  be  used  while  keeping  the  probability  of 
chance  correlations  below  1%.  This  number  is  dependent 
on  the  actual  correlation  achieved  in  the  modeling  process; 
higher  correlation  results  in  a  better  chance  of  using  more 
variables  with  the  same  limited  probability  of  chance 
correlations.  In  this  study  we  are  well  below  the  cutoff  of 
40  independent  variables.  In  fact,  the  total  number  of 
descriptors  which  will  be  used  for  model  construction  and 
estimation  is  23,  well  within  the  bounds  of  the  Topliss  and 
Edwards  criteria.44 

Regression  modeling  was  accomplished  using  the  SAS 
procedure  REG43  on  four  distinct  sets  of  indices.  These  sets 
were  constructed  as  part  of  a  hierarchical  approach  to  QSAR 
model  development.  The  hierarchy  begins  with  the  simplest 
parameters,  the  TSIs.  After  using  the  TSIs  to  model  the 
activity,  the  next  level  of  parameters  are  added.  To  the  indices 
included  in  the  best  TSI  model,  we  add  all  of  the  TCIs  and 
proceed  to  model  the  activity  using  these  parameters. 
Likewise,  the  indices  included  in  the  best  model  from  this 
procedure  are  combined  with  the  indices  from  the  next 
complexity  level,  the  geometrical  indices,  and  modeling  is 
conducted  once  again.  Finally,  the  best  model  utilizin^TSIs, 
TCIs,  and  geometrical  indices  is  combined  with  the  quantum 
chemical  parameters  to  develop  the  final  model  in  the 
hierarchy. 

Additionally,  the  entire  set  of  95  descriptors  (topostruc¬ 
tural,  topochemical,  geometrical,  and  quantum  chemical)  was 
subjected  to  the  variable  clustering  procedure  and  a  reduced 
set  of  independent  variables  was  used  in  constructing  a 
QSAR  model.  This  varies  from  the  other  approach  in  that 
the  indices  were  clustered  as  one  set,  rather  than  as  four 
distinct  sets,  and  resulted  in  a  somewhat  different  set  of 
variables.  This  was  done  to  determine  if  there  is  any 
advantage  in  final  model  predictive  power  between  model 
development  based  on  the  H-QSAR  approach  versus  the 
“kitchen  sink”  approach,  i.e.,  using  the  entire  descriptor  set 
in  order  to  find  the  “best”  model. 

2.6.  Neural  Network  Methods.  Using  neural  networks, 
we  studied  two  classes  of  approaches  for  modeling  toxicity: 
(1)  giving  all  the  descriptors  to  a  learning  algorithm  (neural 
network  in  this  case)  and  (2)  reducing  the  feature  set  before 
giving  the  (reduced)  feature  set  to  a  learning  algorithm. 
Results  for  our  approaches  are  from  leave-one-out  experi¬ 
ments  (i.e.,  69  training/test  set  partitions).  Leave-one-out 
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works  by  leaving  one  data  point  out  of  the  training  set  and 
giving  the  remaining  instances  (68  in  this  case)  to  the  learning 
algorithms  for  training.  This  process  is  repeated  69  times  so 
that  each  example  is  a  part  of  the  test  set  once  and  only 
once.  Leave-one-out  tests  generalization  accuracy  of  a 
learner,  whereas  training  set  accuracy  tests  only  the  learner’s 
ability  to  memorize.  Generalization  error  from  the  test  set  is 
the  true  test  of  accuracy  and  is  what  we  report  here. 

First  we  trained  neural  networks  using  all  95  parameters: 
35  TSI,  51  TCI,  3  geometrical,  and  6  quantum  chemical 
parameters.  The  networks  contained  15  hidden  units  and  were 
trained  for  1000  epochs.  Each  input  parameter  was  normal¬ 
ized  to  a  value  between  0  and  1  before  training.  Additional 
parameter  settings  for  the  neural  networks  included  a  learning 
rate  of  0.05,  a  momentum  term  of  0.1,  and  weights  initialized 
randomly  between  -0.25  and  -fO.25. 

For  our  next  experiment,  we  used  a  smaller  set  of  23 
independent  variables  divided  further  into  the  four  levels  of 
the  hierarchy.  The  23  independent  variables  included  the  5 
topostructural  and  9  topochemical  parameters  provided  by 
the  variable  clustering  technique  (see  section  3.1  for  a  list 
of  the  indices)  combined  with  the  3  geometrical  and  6 
quantum  chemical  parameters  described  in  Table  3.  The 
parameter  settings  for  these  networks  were  the  same  as  the 
settings  for  the  other  neural  network  experiment  mentioned 
above. 

3.  RESULTS 

3.1.  Results  of  Statistical  Regression  Procedures.  The 
variable  clustering  of  the  topostructural  indices  resulted  in 
the  retention  of  five  indices:  Mu  IC,  O,  P9.  All-subsets 
regression  resulted  in  the  selection  of  a  four-parameter  model 
to  estimate  —  log(LC5o)  with  an  explained  variance  ( R 2)  of 
45.3%  and  a  standard  error  (s)  of  0.58.  While  this  is  an 
unsatisfactory  model,  the  indices  were  retained  and  combined 
with  the  topochemical  indices  in  the  second  step  of  model 
development.  The  second  step  combined  the  4  indices  used 
in  the  first  tier  model  with  the  9  topochemical  indices  selected 
in  the  variable  clustering  procedure:  SICo,.  SICi,  SIC4,  CICo, 
Y>  Yc,  Yc,  Yrc,  Jx-  Again,  all-subsets  regression  was 
conducted  resulting  in  a  four-parameter  model  with  an 
explained  variance  (R2)  of  78.3%  and  a  standard  error  (5)  of 
0.36.  The  4  indices  from  the  second  tier  model  were 
combined  with  the  three  geometric  parameters:  3DWh,  3DW, 
Vw-  This  resulted  in  a  four-parameter  model  that  replaced 
the  topochemical  index  CICo  with  the  geometric  parameter 
3dWh.  This  model  had  an  explained  variance  ( R 2)  of  79.2% 
and  a  standard  error  ( s )  of  0.36.  The  final  step  in  the 
hierarchical  method  combined  the  four  parameters  from  the 
third  tier  model  with  the  semiempirical  quantum  chemical 
parameters:  Ehomo,  £homoi,  £lumo»  ^lumoi,  A///,  p.  This 
set  of  10  indices  led  to  a  seven-parameter  model  with  an 
explained  variance  (R2)  of  86.3%  and  a  standard  error  (5)  of 
0.30.  This  model  retained  all  indices  from  the  third  model 
and  added  three  of  the  AMI  quantum  chemical  parameters. 
Our  final  model,  using  indices  selected  from  a  variable 
clustering  of  the  entire  set  of  95  indices  resulted  in  a  seven- 
parameter^  model  including  three  topostructural  indices 
(°X>  P9 ,  IC),  one  topochemical  index  (Y),  one  geometrical 
index  (3DWh),  and  two  quantum  chemical  descriptors 
(A Hf,p).  This  model  had  an  explained  variance  ( R 2)  of  86.1% 
and  a  standard  error  (5)  of  0.30. 
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Table  4.  Relative  Effectiveness  of  Statistical  and  Neural  Network 
Methods  in  Estimating  the  Acute  Aquatic  Toxicity  of  69  Benzene 
Derivatives 


model 

neural  networks 

linear  regression 

Rc2 

s 

Rc2 

s 

TSI 

0.299 

0.63 

0.366 

0.629 

+  TCI 

0.619 

0.47 

0.754 

0.392 

+  3D 

0.656 

0.44 

0.763 

0.384 

+  QC 

0.770 

0.36 

0.825 

0.339 

all  95  indices 

0.758 

0.37 

0.827 

0.337 

Leave-one-out  analysis  was  conducted  on  all  models  for 
purposes  of  comparison  with  the  results  from  the  neural 
networks.  The  resulting  values  for  cross- validated  R2  (Rc2) 
and  standard  error  ($)  are  reported  in  Table  4. 

3.2.  Results  of  the  Neural  Network  Procedures.  The  first 
approach  incorporating  all  95  parameters,  obtained  a  test- 
set  correlation  coefficient  between  predicted  toxicity  and 
measured  toxicity  (explained  variance)  of  R2  —  0.868  and  a 
standard  error  of  0.29.  The  second  approach  utilizes  the 
hierarchical  method  of  grouping  descriptors  resulted  in  four 
models,  one  for  each  level  of  the  hierarchy.  The  results  from 
the  leave-one-out  analysis  of  these  four  models,  as  well  as 
those  for  the  linear  statistical  models  are  summarized  in  Table 
4.  Table  2  presents  the  experimental  acute  aquatic  toxicity 
(—  log[LCso])  values  for  the  69  benzene  derivatives  as  well 
as  the  values  estimated  by  the  best  statistical  model  and  the 
best  neural  network  model,  both  of  which  resulted  from  the 
fourth  H-QSAR  model. 

4.  DISCUSSION 

The  results  show  that  both  statistical  and  neural  network 
models  give  acceptable  estimates  for  the  toxicity  of  the  69 
benzene  derivatives  studied  in  this  paper.  As  can  be  clearly 
seen  from  the  comparative  results  in  Table  4,  there  are  two 
points  in  the  hierarchical  approach  in  which  there  are 
significant  improvements  in  modeling  the  data.  The  addition 
of  the  topochemical  indices  increases  the  variance  explained 
in  both  the  statistical  and  neural  network  models  by  30— 
40%  with  a  consequent  drop  in  the  standard  error  of  the 
calculations  as  well.  Addition  of  the  quantum  chemical 
parameters  also  creates  a  significant  increase  in  the  efficacy 
of  both  models,  a  6.2%  increase  in  the  variance  explained 
for  the  statistical  model  and  an  11.4%  increase  for  the  neural 
network  model. 

It  is  interesting  to  note  that  the  neural  network  model  using 
the  subset  of  23  inputs  selected  in  part  by  the  VARCLUS 
procedure  gave  slightly  better  results  as  compared  to  the 
network  developed  using  all  95  input  variables.  This  could 
be  the  result  of  filtering  out  redundant,  or  nearly  redundant, 
parameters  from  the  set  of  independent  variables. 

Further  work  on  the  relative  utility  of  statistical  vis-a-vis 
neural  network  methods  is  necessary  to  determine  which 
types  of  models  are  best  suited  to  the  estimation  of  chemical 
toxicity. 

ACKNOWLEDGMENT 

This  paper  is  contribution  no.  270  from  the  Center  for 
Water  and  the  Environment  of  the  Natural  Resources 
Research  Institute.  Research  reported  in  this  paper  was 
supported,  in  part,  by  Grants  F49620-94- 1-0401  and  F49620- 


96-1-0330  from  the  United  States  Air  Force,  Grant  IRI- 
9734419  from  the  National  Science  Foundation,  and  a 
MONTS  grant  from  the  University  of  Montana. 

REFERENCES  AND  NOTES 

(1)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.  In  Topological  Indices 
and  Related  Descriptors  in  QSAR  and  QSPR;  Devillers,  J.,  Balaban, 
A.  T.,  Eds.;  Gordon  &  Breach:  Reading,  U.K.,  1999;  pp  675-696. 

(2)  Basak,  S.  C.  In  Practical  Applications  of  Quantitative  Structure- 
Activity  Relationships  (QSAR)  in  Environmental  Chemistry  and 
Toxicology ;  Karcher,  W.,  Devillers,  J.,  Eds.;  Kluwer  Academic 
Publishers:  Dordrecht,  The  Netherlands,  1990;  pp  83-103. 

(3)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.  In  QSAR  in  Environmental 
Sciences,  Vol.  7;  Chen,  F.,  Schtitirmann,  G.,  Eds.;  SETAC  Press: 
Pensacola,  FL,  1998;  pp  245—261. 

(4)  Basak,  S.  C.;  Gute,  B.  D.  In  Discrete  Mathematical  Chemistry ;  Hansen, 
P.,  Paradis,  N.,  Eds.;  DIMACS  Series  in  Discrete  Mathematics  and 
Theoretical  Computer  Science,  Vol.  51;  American  Mathematical 
Society:  Providence,  RI,  2000;  pp  9-24. 

(5)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.  Assessment  of  the 
mutagenicity  of  chemicals  from  theoretical  structural  parameters:  A 
hierarchical  approach.  SAR  QSAR  Environ.  Res.  1999, 10, 117-129. 

(6)  Gute,  B.  D.;  Grunwald,  G.  D.;  Basak,  S.  C.  Prediction  of  the  dermal 
penetration  of  polycyclic  aromatic  hydrocarbons  (PAHs):  A  hierarchi¬ 
cal  QSAR  approach.  SAR  QSAR  Environ.  Res.  1999,  10,  1-15. 

(7)  Basak,  S.  C.;  Gute,  B.  D.  Characterization  of  molecular  structures 
using  topological  indices.  SAR  QSAR  Environ.  Res.  1997,  7,  1—21. 

(8)  Gute,  B.  D.;  Basak,  S.  C.  Predicting  acute  toxicity  (LC50)  of  benzene 
derivatives  using  theoretical  molecular  descriptors:  A  hierarchical 
QSAR  approach.  SAR  QSAR  Environ.  Res.  1997,  7,  117—131. 

(9)  Mushrush,  G.  W.;  Basak,  S.  C.;  Slone,  J.  E.;  Beal,  E.  J.;  Basu,  S.; 
Stalick,  W.  M.;  Hardy,  D.  R.  Computational  study  of  the  environmental 
fate  of  selected  aircraft  deicing  compounds.  J.  Environ.  Sci.  Health 
1997,  A32  (8),  2201-2211. 

(10)  Basak,  S.  C.;  Grunwald,  G.  D.  Predicting  mutagenicity  of  chemicals 
using  topological  and  quantum  chemical  parameters:  A  similarity 
based  study.  Chemosphere  1995,  31,  2529—2546. 

(11)  Basak,  S.  C.;  Grunwald,  G.  D.  In  Proceeding  of  the  XVI  International 
Cancer  Congress;  Rao,  R.  S.,  Deo,  M.  G.,  Sanghui,  L.  D.,  Eds.; 
Monduzzi:  Bologna,  Italy,  1995;  p  413. 

(12)  Hall,  L.;  Kier,  L.;  Phipps,  G.  Structure-activity  relationship  studies 
on  the  toxicities  of  benzene  derivatives:  I.  An  additivity  model. 
Environ.  Toxicol.  Chem.  1984,  3,  355—365. 

(13)  Gombar,  V.  K.;  Enslein,  K.;  Blake,  B.  W.  Assessment  of  develop¬ 
mental  toxicity  potential  of  chemicals  by  quantitative  structure-toxicity 
relationship  models.  Chemosphere  1995,  31,  2499-2510. 

(14)  Auer,  C.  M.;  Nabholz,  J.  V.;  Baetcke,  K.  P.  Mode  of  action  and  the 
assessment  of  chemical  hazards  in  the  presence  of  limited  data:  Use 
of  structure-activity  relationships  (SAR)  under  TSCA,  Section  5. 
Environ.  Health.  Perspect.  1990,  87,  183—197. 

(15)  CAS.  The  latest  CAS  registry  number  and  substance  count,  http:// 
www.cas.org/cgi-bin/regreport.pl,  2000. 

(16)  Johnson,  J.  Pact  triggers  tests:  Thousands  of  chemicals  may  be  tested 
under  toxicity  screening  program.  Chem.  Eng.  News  1998,  76,  19— 
20. 

(17)  Chen,  J.  J.;  Wu,  R.;  Yang,  P.  C.;  Huang,  J.  Y.;  Sher,  Y.  P.;  Han,  M. 
H.;  Kao,  W.  C.;  Lee,  P.  J.;  Chiu,  T.  F.;  Chang,  F.;  Chu,  Y.  W.;  Wu, 
C.  W.;  Peck,  K.  Profiling  expression  patterns  and  isolating  differen¬ 
tially  expressed  genes  by  cDNA  microarray  system  with  colorimetry 
detection.  Genomics  1998,  51,  313—324. 

(18)  Schena,  M.;  Shalon,  D.;  Davis,  R.  W.;  Brown,  P.  O.  Quantitative 
monitoring  of  gene  expression  patterns  with  a  complementary  DNA 
microaiTay.  Science  1995,  270,  467—470. 

(19)  De  Risi,  J.;  Penland,  L.;  Brown,  P.  O.;  Bittner,  M.  L.;  Meltzer,  P.  S.; 
Ray,  M.;  Chen,  Y.;  Su,  Y.  A.;  Trent,  J.  M.  Use  of  a  cDNA  microanray 
to  analyse  gene  expression  patterns  in  human  cancer.  Nat.  Genet.  1996, 
14. ,  457-460. 

(20)  Witzmann,  F.  A.;  Fultz,  C.  D.;  Grant,  R.  A.;  Wright,  L.  S.;  Komguth, 
S.  E.;  Siegel,  F.  L.  Differential  expression  of  cytosolic  proteins  in  the 
rat  kidney  cortex  and  medulla:  Preliminary  proteomics.  Electrophore¬ 
sis  1998,  19,  2491-2497. 

(21)  Anderson,  N,  L.;  Esquer-Blasco,  R.;  Richardson,  F.;  Foxworthy,  P.; 
Eacho,  P.  The  effects  of  peroxisome  proliferators  on  protein  abun¬ 
dances  in  mouse  liver.  Toxicol.  Appl.  Pharm.  1996, 137,  75-89. 

(22)  Lake,  B.  G.;  Lewis,  D.  F.  V.;  Gray,  T.  J.  B.;  Beamand,  J.  A.  Structure- 
activity  relationships  for  induction  of  peroxysomal  enzyme  activities 
in  primary  rat  hepatocyte  cultures.  Toxicol,  in  Vitro  1993,  7,  605— 
614. 

(23)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.;  Opitz,  D.  W.;  Balasubra- 
manian,  K.  In  Predictive  Toxicology  of  Chemicals:  Experiences  and 


890  J.  Chem.  Inf.  Comput.  Sci.,  Vol.  40,  No.  4 ,  2000 

Impact  of  Al  Tools-Papers  from  the  1999  AAAI  Symposium ;  AAAI 
Press:  Menlo  Park,  CA,  1999;  pp  108-111. 

(24)  Basak,  S.  C.;  Gute,  B.  D.;  Ghalak,  S.  Prediction  of  complement- 
inhibitory  activity  of  benzamidines  using  topological  and  geometric 
parameters.  J.  Chem.  Inf  Comput.  Sci.  1999,  39,  255-260. 

(25)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.  Use  of  topostructural, 
topochemical,  and  geometric  parameters  in  the  prediction  of  vapor 
pressure:  A  hierarchical  OSAR  approach.  J.  Chem.  Inf.  Comput.  Sci. 
1997,  37,  651-655. 

(26)  Basak,  S.  C.;  Gute,  B.  D.;  Grunwald,  G.  D.  A  comparative  study  of 
topological  and  geometrical  parameters  in  estimating  normal  boiling 
point  and  octanol/water  partition  coefficient.  J.  Chem.  Inf  Comput. 
Sci.  1996,  36,  1054-1060. 

(27)  Basak,  S,;  Harriss,  D.;  Magnuson,  V.  POLLY  2.3 ;  University  of 
Minnesota:  Duluth,  MN,  1988. 

(28)  Wiener,  H.  Structural  determination  of  paraffin  boiling  points.  J.  Am. 
Chem.  Soc.  1947,  69,  17-20. 

(29)  Randid,  M.  On  characterization  of  molecular  branching.  J.  Am.  Chem. 
Soc.  1975,  97,  6609  -6615. 

(30)  Kier,  L.;  Hall,  L r  Molecular  Connectivity  in  Structure-Activity 
Analysis ;  Research  Studies  Press:  Hertfordshire,  U.K.,  1986. 

(31)  Basak,  S.  C.;  Magnuson,  V.  R.;  Niemi,  G.  J.;  Regal,  R.  R.  Determining 
structural  similarity  of  chemicals  using  graph-theoretic  indices. 
Discrete  Appl.  Math.  1988,  19,  17-44. 

(32)  Raychaudhury,  G;  Ray,  S.  K.;  Ghosh,  J.  J.;  Roy,  A.  B.;  Basak,  S.  C. 
Discrimination  of  isomeric  structures  using  information  theoretic 
topological  indices.  J.  Comput.  Chem.  1984,  5,  581-588. 

(33)  Bonchev,  D.;  Trinajstid,  N.  Information  theory,  distance  matrix  and 
molecular  branching.  J.  Chem.  Phys.  1977,  67,  4517-4533. 


Basak  et  al. 

(34)  Basak,  S.  C.;  Roy,  A.  B.;  Ghosh,  J.  J.  In  Proceedings  of  the  Second 
International  Conference  on  Mathematical  Modelling ,  Avula,  X.  J. 
R.,  Bellman,  R.,  Luke,  Y.  L.,  Riglcr,  A.  K.,  Eds.;  University  of 
Missouri  —  Rolla:  rolla,  MO,  1980;  p  851. 

(35)  Roy,  A.  B.;  Basak,  S.  C:  Harriss,  D.  K.;  Magnuson,  V.  R.  In 
Mathematical  Modelling  in  Science  and  Technology,  Avula,  X.  J.  R., 
Kalman,  R.  E.,  Lapis,  A.  I.,  Rodin,  E.  Y.,  Eds.;  Pergamon  Press:  New 
York,  1984;  p  745. 

(36)  Basak,  S.  G;  Magnuson,  V.  R.  Molecular  topology  and  narcosis. 
Arzneim.-Forsch.fDrug  Res.  1983,  33,  501-503. 

(37)  Balaban,  A.  T.  Highly  discriminating  distance-based  topological  index. 
Chem.  Phys .  Lett.  1982,  89,  399-404. 

(38)  Balaban,  A.  T.  Topological  indices  based  on  topological  distances  in 
molecular  graphs.  Pure  Appl.  Chem.  1983,  55,  199-206. 

(39)  Balaban,  A.  T.  Chemical  graphs.  Part  48.  Topological  index  J  for 
heteroatom-containing  molecules  taking  into  account  periodicities  of 
element  properties.  Math.  Chem.  (MATCH)  1986,  21,  115-122. 

(40)  SYBYL  Version  6.4.\  Tripos  Associates,  Inc.:  St.  Louis,  MO,  1998. 

(41)  CONCORD  Version  3.2.7.;  Tripos  Associates,  Inc.:  St.  Louis,  MO, 
1998. 

(42)  Stewart,  J.  J.  P.  MOPAC6.00,  QCPE  #455;  Frank  J.  Seiler  Research 
Laboratory,  U.S.  Air  Force  Academy:  Colorado  Springs,  CO,  1990. 

(43)  SAS/STAT  User's  Guide,  6.03  ed.;  SAS  Institute  Inc.:  Cary,  NC,  1988; 
Chapters  28  and  34,  pp  773-875,  949-965. 


09901136 


